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ABSTRACT 


An attempt has been made in this project to find optimum neural 
network configuration, using MATLAB Toolbox, for some of the 
benchmark problems. These problems are considered difficult to solve, using 
standard ANN techniques. Beside these problems some complicated 
functions have also been considered and an attempt has been made to solve 
them. All the problems considered in this work are typical in the sense that 
they capture the extremities of most of the parameters. Problems considered 
fall in the category of Classification and Function Approximation. 

Codes developed in JAVA were used to solve the same problems. The 
results thus obtained were used to compare to that obtained by using 
MATLAB Toolbox. The codes developed using JAVA have undergone 
refinement may be used for modeling of Neural Networks for real life 
problems. 
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Chapter 1 


Introduction 

Several attempts have been reported to understand and model the capabilities of human 
brain. Some of these are (Ej(pert System, genetic yL^oritfim, JirtificiaC SfeuraC !Networ^, Tiizzy 
Logic etc. These algorithms represent different level of human information processing. It 
has been believed that human brain has exceptional capability to recall, optimize, memorize, 
sort arid search. However, each model may not perform all functions performed by brain 
independently. In the present work the computational power of Artificial Neural 
Networks have been explored. 

1.1 Artificial Neural Networks 


ANN consists of many simple elements called neurons. The neurons interact with 
each other using weighted connection similar to biological neurons. Inputs to artificial 
neural net are multiplied by corresponding weights. All die weighted inputs are then 
segregated and then subjected to non-linear filtering to determine the state or active level 
of the neurons. 

Neurons are generally configured in regular and highly interconnected topology in 
ANN. For example, in the Hopfield model, neurons form a fully coimected topology, and 
output firom each neuron feeds as an input to the neighboring neurons. In the 
Backpropagation model and Boltzman machine, the networks consist of one or more 
layers between input and output layers. In the self-organizing feature map, the networks 
coimect a vector of input neurons to a two dimensional grid of output neurons. There is 
no clear cut methodology to decide parameters, topologies and method of training of 
ANN. Hence, to build the ANN is time consuming and computer intensive. However 
these can be used in real time because of inherent paraHeGsms and noise immunity 
characteristics. 
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1.2 Models of Artificial Neural Network 


Neural networks can be de fi ned as an interconnection of neurons such that neuron 
outputs are connected, through weights, to all other neurons including themselves; both 
lag-free and delay connections are permitted. The networks can be broadly classified as: 

(i) Feed forward Network 

(ii) Feed back Network 

1.2.1 Feed forward Network: Considering an elementary feed-forward architecture of 
m neurons receiving n inputs as shown in the next figure. 



Figure 1.1 : A feed-forward network 

m = number of neurons, 
n = number of inputs 
Its output and input vectors are given by: 


Input Vector ^ = [x, , Xj x„ | 

Output Vector O=[0,,02 0„| (1.1) 


wy = weight between i* neuron and j*** input 
Activation value for i* neuron can be written as 

n 

net I = x^ , for i = 1 , 2 m (1.2) 


The non-linear transformation after this is 
Oi = /(w' x) ; i = 1, 2 m 
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The above-mentioned non-linear transformation is performed by each of the m neurons 


i = l,2 

m 

(1.3) 

W/2 W.J 


(1.4) 


r = Non-linear matrix operator, the mapping of input space x to output space 0 
implemented by the network can be expressed as follows: 


0 = r\Wx] 

W = Weight matrix or connection matrix 


(1.5) 


and 


W A 




1 

to 


^21 • 

. . W^„ 


■ • 


/(.) 

0 

. 0 ■ 

0 

/(•) • 

. 0 

0 

. 

• /(.). 


(1.6a) 


(1.6b) 


f(.) = Non linear activation function lying on the diagonal of T operates component-wise 
on activation values (net) of each neuron. 

X and O are often called input and output patterns. The mapping of an input pattern into 
an output pattern is of the feed forward a nd instantaneous type, since it involves no time 
delays between the input X and the output O. 


Oit) = T\Wx{t)] 


(1.7) 
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Example 1.1: Two-layer feed-forward network using neurons having bipolar binary 
activation function 

. . . f+1 for net >0 

f (net) A sgn(neO = ^ 

[- 1 for net < 0 

The network to be analyzed is shown below: 



I/p of layer 1 0/p of layer 1 I/p of layer 2 0/p of layer 2 



: Two- Dimensional Space Mapping 


Of the Network 
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Our purpose is to find output O5 for a given network and input patterns. Each network is 
described by the formula; 

0=r[£x] 

Consider the first layer: 

0 = [ 0 ^ 020 ^ 0 ^] ; x= [xjXjjCjJ Or 

1 0 1 

-1 0 -2 

0 1 0 

0 -1 -3 

Similarly for the second layer 

O = [O5] ; X = [O; O2 O3 O4 -ij ; FFj = [1 1 1 1 3 . 5 ] 

The response of the first layer (for bipolar binary activation function) can be given as: 

0= [sgn(Xi-l) sgn(2-Xi) sgn(x 2 ) sgn(-X2+3)]' 

The mapping performed by the first layer is described next. 

In this, each of the neurons 1 through 4 divides the plane xi, X2 into two half-planes. The 
half-planes where the neurons responses are positive (+1) have been marked on figure 1.3 
with arrows pointing towards positive response half-plane . The response of the second 
layer can be easily obtained as: 

O5 = sgn (Oj -I- O2 + O3 + O4 - 3 . 5 ) 

05=+l iff O, =02 =03 =04 =1 

It therefore selects the intersection of four half-planes produced by the first layer and 
designated by the arrows. 
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Mapping Provided bv the same Architecture but the Neurons having Sigmoidal 
Characteristic 


For continuous bipolar activation function 

finet) = ? 1 

l + exp(->teet) 

The output for first layer will be given by 


O 


1 + exp(l - 
2 

l + exp(X] -2)X 
2 


1 + exp(-X2).^. 
2 


-1 


-1 


l + exp(x2-3)l , 


For the second layer, it would be 

2 




1 4- exp(3.5 - Oj - O 2 “ 


-1 


( 1 . 8 ) 


(1.9) 


The network with neurons having sigmoidal activation function can perform two- 
dimensional space mapping. The network considered with bipolar continuous neurons 
provides mapping of the entire xi, X 2 plane into the interval (-1, 1) on the real number 
axis. In fact NN with as few as two layers are capable of universal approximation from 
one finite dimensional space to another 

1 .2.2 Feed back Network: By connecting the neurons’ outputs to their inputs we can 
convert the feed-forward to feedback network as shown in figure 1.4. Feedback loop 

control the output 0, through outputs Oj , j =1, 2 m. Here the present output 

0{t) controls the output at the following instant, 0(t + A) . The time A’ between t and 
t+A is introduced by the delay elements in the feedback loop 
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1 .4: A feedback Network 


0(r + A) = r[wO(0] (1.10) 

The input x(t) is only needed to initialize this network so that 0(0) = x(0) . The input is 
then removed and the system remains autonomous for t > 0 


There are two main categories of single layer feedback networks. If we consider time as 
a discrete variable and decide to observe the network performance at discrete time 
instances A, 2A, 3A the system is called discrete time . Where, 


A = Unit delay 

t = 0 , is the initial state 

for a discrete - time Artificial Neural System, 

fork =1,2 (1.11a) 
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& =r|)^rw/] 

o^^'^rlwTi r[wx“]]] (1.1 lb) 

The above network is called recurrent since its response at k+1^’’ instant depends on the 
entire history of the network starting at k=0. Recurrent networks operate with discrete 
representation of data; they employ neurons with hard-limiting activation function. 
However, if we have infinitesimal delay then the output vector may be assumed to be 
continuous time function. 

1.3 Knowledge Representation 

The primary characteristics of knowledge representation are twofold: (1) what 
information is actually made explicit; and (2) how the information is physically encoded 
for subsequent use. A major task for a neural network is to learn a model of the world in 
which it is embedded and to maintain the model sufficiently consistent with the real 
world so as to achieve the specified goals of the application of interest. In a neural 
network, the subject of knowledge representation is very complex. Since there are 
multiple sources of information that activate the network, and these sources interact 
among themselves, adding to the complication. However, there are four rules for 
knowledge representation that are of general common-sense nature. These four rules are: 

Rulel . Similar inputs from similar classes should generally produce similar 
representation inside the network, and should therefore be classified as belonging to the 
same category. 

Rule 2. Items to be categorized as separate classes should be given widely different 
representation in the network. 
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Rule 3 . If a particular feature is important, then there should be large number of neurons 
involved in the representation of that item in the network. 

Rule 4 . Prior information and invariance should be built into the design of a neural 
network, thereby simplifying the network design by not having to learn them. 

1.4 Artificial Intelligence and Neural Network 

The aim of artificial intelligence (AI) is the development of algorithms that 
require cognition when performed by humans. This statement on AI is adapted from 
Sage(1990). An AI system must be capable of doing three things: (1) store knowledge; 
(2) apply the knowledge stored to solve problems; and (3) acquire new knowledge 
through experience. An AI system has three key components: representation, reasoning 
and learning. 

1.4.1 Representation: AI uses the language of symbols structure to represent both 
general knowledge about a problem of interest and specific knowledge about the solution 
to the problem. The symbols are formulated in familiar terms making symbolic 
representations of AI relatively easy to understand by a human user. “Knowledge,” in the 
case of AI is basically data. It may be declarative or procedural kind. In a declarative 
representation, knowledge is a static collection of facts, with a set of procedures used to 
manipulate these facts. In the procedural representation, knowledge is enclosed in an 
executable code that acts out the meaning of the knowledge. Both the kinds of 
representation are needed in most of the problems. 

1.4.2 Reasoning: It’s the ability to solve problems. A system must satisfy certain 
conditions to be classified as a reasoning system (Fischler and Firschein, 1987): 

• The system must be able to express and solve a wide range of problems. 

• The system should be able to make explicit any implicit information available to 
it. 
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• The system must have a control mechanism that indicates which operation 
should be applied to a particular problem, when a solution to the problem has 
been found, or when further work on the problem should be terminated. 

Problem solving can be thought as a searching problem. Search is carried out using rules, 
data, and control. The rules operate on the data and the control operates on the rules 
(Nilsson, 1980). In many problems the available knowledge is either incomplete or it can 
not be extracted. In such cases, probabilistic reasoning procedures are used, thereby 
permitting AI systems to take the uncertainty of the problem in account. 

1 .4.3 Learning: As far as machine learning is concerned we can broadly classify it into 

three types. 

i. Inductive 

ii. Deductive 

iii. Abduction 

These are three ways in which the available information may be processed. Inductive 
processing deals with determining general patterns, or organizational schemes rules and 
laws from raw data, experience or examples. Inductive computations perform 
abstractions, producing generalities from specifics. The creation of models and theories 
is basically inductive work. Whereas determining of specific facts using general rules is 
deductive information processing. Also determination of new general rules from old 
ones is called deductive for examples a general rule can be area of a circle is “pi times 
square of radians”, from this we deduce that a circle of radii 5 cm. Has an area of w(7t- 
25). Proof of theorems is also a deductive process as it requires rules of inference and 
previous theorems. When learning is concerned with enlargement of knowledge base, 
then one generally asks these questions. “Is the new knowledge put in directly”, “Is it 
induced from examples or experience” or “Is it deduced from existing knowledge?” 



Figure 1.5 : Simple model of machine learning 
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1.5 AX versus Neural Network 

The comparison between these two may be carried out under three sub-divisions 
as mentioned below: 

(i) Level of Explanation 

(ii) Style of Processing 

(iii) Representational Structure 

1.5.1 Level of Explanation: In the case of AI the emphasis is on the symbolic 
representation. The representations are discrete and arbitrary: abstract properties, and not 
analog images. AI assumes existence of mental representations, and it models cognition 
as the sequential processing of symbolic representations (Newell and Simon, 1972). In 
the case of the neural network the assumption made regarding the cognitive process is 
entirely different from those in classical AI. The emphasis in neural networks is on the 
development of parallel distributed processing (PDP) models. These models assmne that 
information processing takes place through interaction of a large number of neurons, each 
of which sends excitatory and inhibitory signals to other neurons in the network 
(Rumelhart and McClelland, 1986). 

1.5.2 Processing Style: In the classical AI, the processing is sequential. The operations 
are performed in a step-by-step manner. Whereas, neural networks employ parallel 
processing. This is a source of flexibility and robustness to the neural network. In the 
case of neural network the computation is spread over a large number of neurons. This is 
the reason why it can recognize noisy or incomplete inputs also. A damaged network may 
still be able to function satisfactorily, and the learning does not have to be perfect. Thus, 
parallel distributed processing approximates the flexibility of a continuous system, in 
sharp contrast with the rigid and brittle discrete symbolic AI. 

1.5.3 Representational Structure: With a language of thought pursued as a model for 
classical AI, symbolic representations possess a quasi-linguistic structure. The 
expressions of classical AI are generally complex, built in a systematic fashion from 
simple symbols. Given a limited stock of symbols, new expressions are composed by the 
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virtue of compositionality of symbolic expressions and the einalogy between syntactic 
structure and semantics. 

The nature and structure of representations, in the case of neural network, is a 
crucial problem. In a neural network, representations are distributed. It does not, 
however, follow that whatever is distributed must have constituents, and being distributed 
is very different from having semantic or syntactic constituent structure (Fodor and 
Pylyshyn, 1988). Unfortunately, most of the neural network models proposed to date for 
distributed structure representations are rather ad hoc; they solve the problem for a 
particular class in a way that cannot be extended easily. 

To sum up, symbolic AI may be described as a formal manipulation of a language of 
algorithms and data representations in a top-down fashion. On the other hand, neural 
networks may be described as parallel distributed processors with a natural learning 
capability, and which usually operates in bottom-up fashion. Thus a better approach of 
implementing a cognitive task would be one that combines both these methods. 

1.6 Objective 

The objective of this Thesis is to study the performance of Artificial Neural 
Networks designed using MATLAB Toolbox on typical and general benchmarking 
problems, and comparing these to the performance of Networks designed using codes 
written in JAVA language. The problems considered cover most of the aspects of neural 
network theory. The problems considered belong to these two categories: Classification 
and Function Approximation. This thesis has been organized in five chapters, these are: 
Chapter 1 : Introduction and Problem Definition 
Chapter 2: Feed-forward ANN Models 
Chapter 3: The Back-propagation Algorithm 
Chapter 4: Learning 
Chapter 5: Benchmark Problems 
Chapter 6 : Results and Discussions 
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Chapter 2 

Feedforward ANN Models 


Most methods of training multi-layered Artificial Neural Networks (ANN) are 
based on the steepest descent technique, which frequently have problems such as very 
poor convergence behavior, trapping in local minima, misdirection of descent, and 
oscillation. Furthermore the ANN suffers from problems like network paralysis, 
overgeneralization, and multiple solution problems. Therefore the objective of the chapter 
is to discuss the methodologies to overcome the above problems. Development of ANN 
is performed in two phases: 

1. Training phase: In this phase the ANN tries to memorize the pattern of learning data 
set. This phase consists of the following modules: 

• Selection of Neuron Characteristics 

• Selection of Topology 

• Error Minimization Process 

• Selection of Training Pattern and Preprocessing 

• Stopping Criteria of Training 

2. Testing phase: Here ANN tries to predict and test data sets. 


2.1 Training Phase Issues 

2.1.1 Selection of Neuron Characteristics 

Neurons can be characterized by two operations: Aggregation and Activation 
function. Non-linear filtering (also called activation function) can be characterized by 
several functions: sigmoidal and tangent hyperbolic functions being the most common. 
Both have similar transfer characteristics and hence mapping properties. Various other 
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activation functions are linear, signxim (thresholding function), logarithmic, sinusoidal 
etc. Even though numerous activation functions have been reported in literature, care 
should be exercised in selecting the activation function. Selection of the function is 
problem dependent. For example, the logarithmic activation is used when the upper limit 
is xmbounded while the Radial Basis function (RBF) is used for problems having 
complex boundaries. The main objective is to design an ANN of superior generalization 
ability and lesser number of nodes. This avoids unnecessary calculations and it can be 
used for fast decision-making. 

Certain class of problems can be handled more suitably by fuzzy networks. In 
fuzzy networks some nodes are made up of membership function, n and S type. The 
selection of membership function is again problem dependent and can be handled 
accordingly. 

Since most of the ANN use the gradient method to minimize error, points where 
the gradients are small should be avoided in the beginning of the process(small gradients 
make the learning process slow). Therefore initialize the learning process such that the 
central part of the function where the gradients are large is selected. This can be achieved 
by proper scaling of input signals and the selection of limit between which random 
weights are generated. For example, if the scaling of input is carried out between 0.1 and 
0.9 then the random weights are preferred between -0.5 and +0.5 for standard sigmoidal 
function. A major drawback of these activation functions is their saturation, which can 
cause significant errors when the output has no upper bound. 

2. 1 .2 Selection of Topology 

Topology of ANN deals with tHe numSer of neurons in each layer ssxd their interconnections. 
Too few hidden neurons hinder the learning process, while too many occasionally 
degrade the ANN generalization capability. There axe no clear-cut or absolute guidelines 
available in the literature for deciding the topology of ANN. However numerous 
researchers have evolved their own thumb rules and heuristics. 

One rough guideline for choosing the number of hidden neurons is the geometric 
pyramuf rule . It states that, for many practical networks, the number of neurons follow, in 
successive layers, pyramidal shape, decreasing from input towards the output. The 
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optimal number of hidden neurons can be determined by using rigorous search. The 
number of neurons is fixed in the input and the output layers and therefore, in the starting 
the ANN is trained and tested with the least number of hidden neurons. If the net is not 
being trained then the number of neurons in the hidden layers is increased gradually till 
the error is acceptable. The rigorous search method, as shown in the figure 2.1, is time 
consuming but reliable. 

Pruning of ANN can reduce time complexity of topology decision. Starting with a 
large ANN architecture does this. Large architecture gives less generalization. Later, 
some of the parts of the network are pruned to improve generalization and sensitivity. 
There are various algorithms available in literature for this. 

2. 1 .3 Error Minimization Procedures 

When ANN are trained, the weights of the links are changed/adjusted so as to achieve 
minimum error. During this process a part of the entire data set is used as training set, and 
the error is minimized on this set. Calculations for error may be done using various 
functions. The most commonly used being the ^ot Mean Square (RMS) function and is 
given by: 

Error = (1/2)* V (^[(actual - predicted)^]) (2. 1) 

Equation 2.1 can be generalized as: 

Error = (1/2)* V (Il[(actual - predicted)*']) (2.2) 

Where k = Order of Norm 

We can also use an error function such as 

Error =ZActual*log(predicted) + (1 - Actual) log(l - Predicted) (2.3) 

This error function may increase the learning rate of ANN and its generalization 
capability. 

Generally techniques based on calculus are used to minimize error. These 
methods have very poor convergence, as they may get trapped in local minima, a 
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drawback, which can be eliminated by using evolutionary techniques such as Simulated 
Annealing and Genetic Algorithms. Unfortunately these techniques are not applicable for 
fast training. 

Different strategies are used for updating the weights of ANN. If the weights are 
updated after presentation of the entire set of training patterns to the network, then it is 
called BatcB training. In the second approach, the weights are updated after presentation of 
each training pattern, this is called pattern training or onBne updating . In another approach, 
fi-equency of the given training samples for updating the weights are also dependent on its 
error. 

In fact learning process may be split into two groups. Active and Passive. In 
Passive Learning, all available examples are fed to the ANN for learning, whereas in the 
case of Active learning it is not so. In the case of Active learning ANN is trained with 
pattern corresponding to maximum error. After the network has been trained, it is tested 
on the remaining examples. The samples giving the worst error are then added to the 
training set. This process is repeated xmtil the ANN performs well on the remaining data 
set. Thus the cross validation is done automatically, in this case. 

Training of ANN is influenced by selection of different parameters. Learning and 
momentum rate influence the speed of learning in the gradient decent techniques. 
Increasing the learning rate may improve the speed of training but it may also lead to 
oscillations, whereas a low learning rate may cause slow learning but it will be stable. 
Scaling parameter (gain), in the case of sigmoidal function, also affects the learning 
behavior. 

2.1.4 Selection of T raining Pattern and Preprocessing 

Selection of the training data is very critical for building ANN. Although preprocessing is 
not mandatory, but it definitely improves the performance of the network and reduces the 
learning time. Some commonly used preprocessing methods are: Scaling, Normalization, 
and Noise reduction. In the case of pattern recognition problems, noise is purposely 
added (preprocessing) to improve the recognition capability of the network. 

Several other transformations like linear and log scaling of training data is 
required for good prediction. If the distribution of variables is unusual, it may be more 
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difficult for the ANN to learn to use it, even if the variable is linearly scaled to a 
reasonable range. This is because the information content in the variable is too distorted. 
Small but important variations of the variable may be compressed into a relatively narrow 
area, while other variations may be spread out in a wider range than it is required. Such 
type of data set requires non-linear scaling. Examination of distribution of variables 
before and after the transformation must be done. Preprocessing definitely gives better 
results and is recommended to be done as a routine. 

2.1.5 T ermination Criteria for Training 

The process of adjusting weights of an ANN is repeated until the termination condition is 
met. The training process may be terminated under if any one of these conditions is met: 

• The error goes below a specific value 

• The magnitude of gradient reaches below a certain value 

• Specific number of iterations is complete 

The network training can be terminated by cross validation technique also, as shown in 
figure 2.2. In this technique the data set is divided into a training set and a validation set. 
The training set is used to modify the connection weights and the validation set is used to 
estimate the generalization ability. Training is stopped when the error on the validation 
set begins to rise. This method may not be practical for a small data set. 

The first three criteria are sensitive to the choice of parameter and may lead to 
poor results if the parameters are not properly selected. The cross vedidation does not 
have this drawback. It can also avoid over fitting of the data as shown in the figure 2.4 
and it actually improves the generalization performance of the network. However this 
technique is time insensitive. 

2.2 Testing Phase Issues 

The performance of ANN on the testing data set represents its generalization ability. In 
actual practice the data set is divided into two. One set is used for training and the other 
for testing. In fact a generalized neural network will perform well for both training and 
testing data. Testing must be done for interpolation as well for extrapolation. Although it 


17 



has been found that the perfonnance of ANN for extrapolation is generally poor. During 
the process of testing some of the weights of ANN are removed to check the stability. If 
the performance of ANN does not decline then it implies that the ANN will be stable in 
case of small hardware/software failure. Testing of ANN only with imdistorted testing 
data is not sufficient, a small amount of noise must be added to the data to check the 
stability of the network. 
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FINISHED 


Figure 2.1 : Method for Deciding the Topology of ANN 













Error 



Figure 2,2 : An Example of Validation of Data 
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Chapter 3 


The Backpropagation Algorithm 

The Error-Back-propagation or Multi-layer Feed Forward or Backprop, as shown in 
figure 3.1, are used to represent the Back-propagation algorithm. It is the most important 
algorithm for the supervised learning of multi-layer feed forward ANNs. Here the error 
signals are propagated backwards through the network on a layer-by-layer basis. 

The backprop algorithm is based on the selection of a suitable error fimction or cost 
function, whose values are determined by the actual and the desired outputs of the 
network and which is also dependent on the network parameters such as weights and 
thresholds. The basic idea is that the cost function has a particular smface over the weight 
space and therefore an iterative process such as the gradient descent method can be used 
for its minimization. The method of gradient descent is based on the fact that since the 
gradient of a function always points in the direction of maximum increase of the function, 
thus moving in the direction of negative gradient induces a maximal “downhill” 
movement that will eventually reach the minimum of the function surface over its 
parameter space. This is a rigorous and well established technique for mi n imization of 
functions and has probably been the main factor behind the success of back-propagation. 
However, this method does not guarantee that it will always converge to minimum of 
error surface as the network can be trapped in various local minima. 

The backprop algorithm training consists of two passes of computations: forward 
pass and a Sac^ardpass, as shown in figure 3.2. During the forward pass the weights are 
fixed whereas during the backward pass, the synaptic weights are adjusted in accordance 
with error correction ruCe (error = desired - actual). In the forward pass an input pattern vector 
is applied to the input layer nodes. The signal firom the input layer propagates to the units 
in the first layer and each unit produces an output according to the equation: 

Yj=l/(l+exp(-Vj) 

Vj = Net internal activity level of neuron j 

i=N 

Yj = Output of neuron = f( ^ ^tX, ) 

/=o 
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The outputs of these units are propagated to the units in the subsequent layers and the 
process goes on until the signals reach the output layer where the actual response of the 
network to the input vector is obtained. In the backward pass the synaptic weights are 
adjusted in accordance with an error signal, which is propagated backwards through the 
network against the direction of synaptic connection. 



Figure 3.1 : A feed-forward network 


► Function signal 



Error signal 

Figure 3.2 ; Flow of function and error signals 
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3-1 Mathematical Analysis of Backpropagation Algorithm 


Forward pass : Given an input pattern vector , each hidden node j receives a net 
input 

(3.1) 

k 

Wjk = weight between hidden node j and input node k 
Output of node j is given by 

k 

Input to each output node i is given by 

J J k 

wy = weight between output node i and hidden node j. 

Final output from the output node i 

y,(p) =£j:x,<'»)=f( 2 ;M-„y“)=f( 2 ’-,f( 2 ; 0.4) 

kj j k 

Backprop algorithm can be implemented in two different modes: on Cine mode and Batch 
mode. These are discussed in brief 

Online mode : The error function is calculated after the presentation of each input pattern 
and the error signal is propagated back through the network modifying the weights before 
the next input pattern is applied. The most commonly used error function is the Mean 
Square Error (MSE) of the difference between desired and the actual responses of the 
network over all the output units. After this the new weights remain fixed and a new 
pattern is presented to the network and this process continues until all the patterns have 
been presented to the network. Presentation of the entire set of patterns is termed as one 
epoch or one iteration. 

Batch mode : The error signal is calculated for each input pattern but the weights are 
modified only when all input patterns have been presented. The error function is 


(3.2) 

(3.3) 
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calculated as the sum of the individual MSE errors for each pattern and the weights are 
accordingly modified (all in a single step for all the patterns) before the next iteration. 
Error function in batch mode, calculated as MSE over all output units i and over all 
patterns p is given by: 

^ p i 

‘ (3-5) 

^ p ‘ \ \ j J) 

E here is differentiable function of all the weights (and thresholds according to the 
bias convention) and therefore we can apply the method of gradient descent 
For the hidden-to-output connections the gradient descent rule gives 

AIF.. =- 77 — (3.6) 

. ^ dWy 

Where 77 is a constant, it determines the rate of learning, and is therefore called the 
learning rate of the backpropagation algorithm. Using chain rule on the above equation 
we have 

dWy ~ dWy 

and , (3-7) 


dWy ~ dWy J 

from these two equations we have 


dE _ BE dy^'^ 

dWy 


P 


( 3 . 8 ) 
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Nodes k 

Nodes j 

Nodes i 

(Input) 

(Hidden) 

(Output) 


Figure 3.3: Multi-layer Feed-forward Network 


Thus, 

Aft's' -’1^ 

Aft's =-’7Z'y“>'r 

P 

Where = (d/P^ - y^'^) f (xj^p^) 

For Input-to-hidden connections the gradient descent rule gives the weights by the 
following equations: 

(3.11) 


(3.9) 

(3.10) 


25 


Applying the chain rule to the above equation we get 
dE dE Syf 


dw, dyfew. 


and 


dy^p dyP dx^ 




From both these equation we get 

dE dE dyP dx^ _ dE 


The term 

dE 


dWy dyP dxj dJVj^ dyf 
dE 


f{xf\ 


)yk 


ip) 


dyP 


can be represented as 






yfe'’*) 


P i 


dy) 


(P) 


‘ ’ dxP dyP 


P I 




p ' 


Thus , 


dE 

dfV,, 




(P) 

k 


P f 


= ’> ssk'’ 

P i 

= n fVr>, yr 




p ‘ 


Therefore 

AW, = r, 


Where 




(3.12) 


(3.13) 


(3.14) 

(3.15) 


(3.16) 

(3.17) 
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For equations 3.8 and 3.15 we see that if the activation function f(.) was not differentiable 
then we would be unable to implement the gradient descent rule since it would be 
impossible to calculate the partial derivatives of E with respect to the weights. THAT IS 
WHY DIFFERENTIABILITY of ACTIVATION FUNCTION is so important in the 
back-propagation learning. Consider sigmoid function, its derivative is given by 

f(u) = f(u) [ 1 -f(u)] (3.18) 

This shows that we need not compute f (u) once we know f(u). 

Where, 

f(u) = 1/(1 + exp(-au)) is an example of si^mouf function (Logistic function). 
Equation 3.18 highlights the importance of using sigmoid function as an activation 
function. The activation time is reduced significantly as it is not necessary to calculate 
f (u) separately. 

On-Line Mode of training: In this, the weight updates are given by 

LWy = 77 (di - yO f (xi) yj = 77 yj (3.19) 

= 1 Yj 

KWj, = T] Sjy^ (3.20) 

The difference, as compared to batch learning is that, here there is no summation over the 
set of training patterns. The two equations are the same except with a different definition 
of ^ . In general, with an arbitrary number of layers, the back-propagation update rule 
always has the form: 

Weight Correction = ( A wim) = Learning Rate (7 )* Local Gradient (J 1 )* Input signal(ym) 
This is called the generaCizecf<De[ta ‘^Ce. 

3.1.1 Initialization: An important aspect of back-propagation training is the proper 
initialization of the network. Improper initialization may cause the network to require a 
very long time for training and there is a high probability that the solution eventually 
found may not be the optimum solution. The first step in the back-propagation algorithm 
is initialization of the network. A good choice of initial values of the free parameters (i.e. 
weights and thresholds) of the network can significantly accelerate learning. If all the 
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weights start off with equal values and if the solution requires that unequal weights be 
developed, the system can never learn. 

3.1.2 Learning Rate : Another important parameter is the learning rate. It determines 
what amount of the calculated error sensitivity to weight change will be used for weight 
correction. The “best” value of learning rate depends on the characteristics of the error 
surface, i.e., a plot of E versus wy. If the surface changes rapidly, the gradient calculated 
only on local information will give poor indication of the true “right path.” In such a case 
smaller learning rate is desirable. On the other hand, if the surface is relatively smooth, a 
large learning rate will speed up convergence. This logic however requires a prior 
knowledge of the error surface, which is rarely available. A general rule is to use the 
largest learning rate that works and does not cause oscillation. A rate too large may cause 
the system to oscillate and thereby slow or prevent the network’s convergence. 

3.1.3 Momentum Method: The piupose of the momentmn method is to accelerate the 
convergence of the error back-propagation learning algorithm. In this method the weight 
update is supplemented with a fraction of the most recent weight adjustment. This is a 
very simple method of increasing the learning rate and yet avoids the danger of 
instability. The weight update is done according to the formula: 

Ar,,(n) = -7^+aA)r^(»-l) (3.21) 

dWj, 

where the arguments n and n-l are used to indicate the current and the most recent 
training steps, respectively, and a is a user selected positive momentum 


3.2 Activation Functions 


Neural networking theory shows that backprop networks can represent most reasonable 
functions as close as you like with linear output units and a single layer of non- 
polynomial hidden layer units. There are however many activation functions that you can 
choose from and each one has its own special virtues. 
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The original network activation function was the Rnear activation function:: 

y = D*x (3.22) 

where x is the input to the neuron, y is the final value of the neuron and usually D = 1 . 
This is also called identity mapping. The figure below shows some linear activation 
functions 



Figure 3.4 : Linear Activation Functions 

Linear functions are inadequate to approximate most of the functions, and therefore, 
some non-linear functions are needed. 

• The standard^ sigmovC {ox logistic) runs from 0 to 1 and it is: 

y=l/(l+exp(-D*x)) (3.23) 

where the input to the neuron is x and most often D=l. The derivative is: y * (1 - y). 
There is some theory and a few experiments that show that hidden layer unit activation 
values centered around 0 will speed up training so in some cases people subtract 0.5 from 
the above. The standard sigmoid can be approximated using the function 

X I f(x) 

X >= 1 1 1 

-l<x<l I 0.5 + X * (1 - abs(x) / 2) (3.24) 

x<= -1 I 0 

if you use x = input / 4.1. The maximal absolute deviation between this function and the 
standard sigmoid is less than 0.02. The figure 3.5 shows some of the logistic activation 
functions used: 
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Output D= 00 



Figure 3.5 : Logistic Activation Functions 

• Another popular function is tanH, it has outputs in the range -1 to 1 and it can be written 
as: 

y = 2 / (1 + exp(-2 *x)) - 1 (3.25) 

The derivative is: 1 - y * y. Because its values are centered around 0 there is no reason to 
believe that using tanh will result in faster training. Experiments show that sometimes 
tanh is better but sometimes it is not. 

•One can also use approximation of sigmoid functions with a series of straight lines, a 
piecewise linear function. These may require more iteration to solve the problem but even 
so it will save CPU time. 

• Sometimes the ^atissian function, (as shown in figure 3.6): 

y = exp(- X * x) (3 .26) 

is used and in some cases it can produce faster training, for instance in the 2-1-1 XOR 
network with direct input to output connections you can get faster training. The Gaussian 
also improves the performance of the 10-5-10 ENCODER problem. The derivative is: 
-2*x*y, where x is the input and y is the value of the neuron. 
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Figure 3.6 : Gaussian Activation Function 

• The following sigmoid runs from -1 to 1 and is also faster to compute 

y = x/(l+|x|) (3.27) 

The derivative is : 1/((1 + lx|)*(l + |x|)) 

• This sigmoid runs from 0 to 1 and is also faster to compute: 

y = (x/2)/(l +|x|) + 0.5 (3.28) 

Its derivative is given by : 1 / (2*(1 + |x|)*(l + |xl) 

The last two sigmoids approach their extremes more slowly. This means that if you are 
trying to output numerical values it will take more iterations to reach your target value. 
But if you are doing a classification problem you really care to get the correct output 
value greater that other outputs and here these functions will save time without 
influencing the number of iterations required by very much. 

Theory says that backprop can approximate most normal functions if and only if the 
hidden later unit is non-polynomial however the following function can also work: 

y = sgn(x) * X * X (3.29) 

and it has the virtues of running from minus infinity to plus infinity and being fast to 
compute. The derivative is sgn(x)*2*x. However, quite often the calculations will go 
wild unless very small learning rates are used, or better still use an acceleration algorithm 
that will automatically control the learning rates. 
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3.3 Faster Traininty 


Plain back-propagation is terribly slow and everyone wants it to go faster. There are a 
series of things that can be done to speed up learning: 

• Fudge the derivative term 

• Scale the Data 

• Direct Input-Output Connections 

• Vary the Sharpness (Gain) of the Activation Function 

• Use a different Activation Function 

• Use better Algorithms 

3.3.1 Fudge the Derivative Term: The first major improvement to backprop is 
extremely simple: you can fudge the derivative term in the output layer. If using the usual 
backprop activation function: 

1 / (1 + exp(-D*x)) 
the derivative is 

s * (1 - s) 

where s is the activation value of the output unit and most often D =1. The derivative is 
largest at s = 14 and it is here that you will get largest weight changes. Unfortunately, as 0 
or lis approached, the derivative term gets close to 0 and the weight changes become 
very small. In fact if the network’s response is 1 and the target is 0, that is the network is 
off by quite a lot, you end up with very small weight changes. It can take VERY long 
time for the training process to correct this. More than likely one gets tired of waiting. 
Fahlman’s solution was to add 0. 1 to the derivative term making it : 

0.1 + s*(l - s) 

The solution of Chen and Mars was to drop the derivative term altogether, in effect the 
derivative was 1. This method passes back much larger error quotas to the lower layer, so 
large that a small eta must be used there. In their experiments on the 10-5-10 encoder 
problem they found the best results came when that eta was 0.1 times the output level eta. 
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hence they called their method the “differential step size” method. One tenth is not 
always the best value so one must experiment with both the upper and the lower level 
etas to get the best results. Besides that, the eta used for the upper layer must also be 
much smaller than the eta used vdthout this method. 

3.3.2 Direct Input-Output Connections: Adding direct connections from the input 
layer to the output layer can often speed up training. It is supposed to work best when the 
function to be approximated is almost linear and it only needs a small amoimt of 
adjustment from nonlinear hidden layer units. This method can also cut down on the 
number of hidden layer units you need. It is not recommended when there are a large 
number of output units because then you add more free parameters to the net and possibly 
hurt generalization. 

3.3.3 Adjusting the Sharpness/Gain: The training time can be decreased by 
increasing the sharpness or gain (D) in the standard sigmoid: 

1/ (1 + exp(-D*x)) 

In fact they show that the training time goes as 1/D for training without momentum and 
l/sqrt(D) for networks with momentum. This is not a perfect speed-up scheme since 
when D is too large you run the risk of becoming trapped in a local minimum. Sometimes 
the best value for D is less than 1. This can be used in combination with all other training 
algorithms. 

3.3.4 Better Algorithms: It is desirable to have faster training and there are many 
variations on backprop that will speed up training times enormously. However, from 
experience it has been found that very slow online update method will sometimes give 
better results as compared to these faster algorithms, although these may be pitifully 
slow. In most cases the accelerating algorithms work so much faster than either online or 
batch backprop, so one must first try these faster methods and then if better results are 
desired then try either the slower online update or batch method. There is a whole 
collection of algorithms where different learning rates (eta) are assigned to each weight. 
As the training proceeds eta is increased in some way if the error is reducing (downhill). 
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When the weight change becomes too large then the net lands on the hill and then the 
learning rate is required to be reduced. Algorithms vary on both the speeding up and 
slowing down details. Second there is a set of algorithms known as conjugate gradient 
methods. Third there are methods that build up the network one hidden unit at a time. 
These are some of common algorithm, most of these methods involve extra calculations 
to determine the value of the leeiming rate and momentum term.; 

• The Rprop Algorithm 

• The Quikprop Algorithm 

• The SuperSAB Algorithm 

• Conjugate Gradient 

• Cascade Correlation 

• Variable Learning Rate Algorithms(GRBH, IG) 

The Rprop Algorithm : Backprop uses sigmoid transfer functions in its hidden layers. 
These functions are called squashing functions, since they compress an infinite input 
range into a finite output range. Their slope approaches zero as the input gets large. This 
causes problem when using steepest descent to train a multi-layer network with sigmoid 
functions, since the gradient can have a very small magnitude, and therefore cause small 
changes in the weights and the biases, even though the weights and the biases are far 
iSrom their optimal values. The purpose of resilient backprop training algorithm is to 
eliminate these harmfiil effects of the magnitudes of the partial derivatives. Only the sign 
of the derivative is used to determine the direction of the weight update; the magnitude of 
the derivative has no effect on the weight update. The size of the weight update is 
determined by a separate update value. The update value for each weight and bias is 
increased by a factor delt_inc whenever the derivative of the performance function vdth 
respect to that weight has the same sign for two successive iterations. The update value is 
decreased by a factor delt_dec whenever the sign changes with respect to the previous 
iteration. If the derivative is zero, then the update value remains the same. Whenever the 
weights are oscillating the weight change will be reduced. If the weight changes in the 
same direction for several iterations, then the magnitude of the weight change will be 
increased. 
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Quickprop: This is also one of the better training algorithms and is loosely based on 
Newton’s method for finding the root of a quadratic function. Standard backprop 
calculates the weight change based on the first derivative of the error with respect to the 
weight. If the second derivative is also available then better optimum step and the 
direction can be found. The Quickprop modification is an attempt to estimate and utilize 
second derivative information. Quickprop requires saving the previous gradient vector as 
well as previous weight change. The calculation of weight change uses only information 
associated with the weight being updated. 


where, 


AfF,= 




Vw^. (n-1)- Vw,y (ft) \AWy in - 1) 


(3.30) 


(n) = the gradient vector component associated with weight wy in step n. 


VWy. (n-V) = the gradient vector component associated with weight wy in previous step. 
AWy(n-\) = weight change in (n-1)* step. 

A maximum growth factor p is used to limit the rate of increase of the step size like: 

If Aw y{n) > y. AWy(n-l) 

Then AWij(n) = n Aw,y(n-1) 

Fahlman suggested an empirical value 1.75 for n . 

There are some complications in this method. First the step size calculation requires a 
previous value, which is not available at the start. This is overcome by using standard 
back-propagation method for the weight adjustment. The gradient descent weight change 
is given by 

Awy (« + !)= Wy («) - -q VWj,. (3.31) 

Value of q is taken suitably small. 


Second problem is that the weight values are unbounded. They become too large that 
they cause overflow in the computer. 
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Supersab: The Self-Adjusting Back-Propagation Algorithm(SuperSAB) was developed 
by Tom Tollenaere. It is not as fast as quickprop or rprop with the standard problems 
however in this stage its too early to pass a verdict on this algorithm. 

Cascade Correlation : Cascade Correlation by Fahlman and Lebiere (School of Computer 
Science, Pittsburgh) may be the fastest available training method and it constructs the 
network by adding hidden units one at a time. On of the reason it is fast is that only the 
new weights are trained, the rest of the weights stay fixed. It does not work very well 
with the function approximation problems. It begins with a minimal network and then 
automatically trains and adds new hidden units on by one, creating a multi-layer 
structure. Once a new hidden unit has been added to the network, its input-side weights 
are frozen. This unit then becomes a permanent feature-detector in the network, available 
for producing outputs or for creating other, more complex feature detectors. This 
architecture leams very quickly, the network determines its own size and topology, it 
retains the structure it has built even if the training set changes, and it requires no back- 
propagation of error signals through the cormections of the network. 

Conjugate Gradient Methods : The basic back-propagation algorithm adjusts the weights 
in the steepest descent direction(negative of the gradient). This is the direction in which 
the performance function is decreasing most rapidly. It turns out that, although the 
function decreases most rapidly along the negative of the gradient, this does not 
necessarily produce the fastest convergence. In the conjugate gradient algorithms a search 
is performed along conjugate directions, which produces faster convergence than steepest 
descent directions. In most of the training algorithms the learning rate is used to 
determine the length of the weight update (step size). In most of the conjugate gradient 
algorithms the step size is adjusted at each iteration. A search is made along the conjugate 
gradient direction to determine the step-size which will minimize the performance 
function along that line. There are various search functions any of these can be used 
interchangeably with a variety of training functions. The conjugate gradient algorithms 
implemented in the MATLAB toolbox are: Traincgf, Traincgp, Traincgp, Trainscg. 
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All of the conjugate gradient algorithms start out by searching in the steepest descent 
direction (negative of the gradient) on the first iteration. 

Po = - go 

A line search is then performed to determine the optimal distance to move along the 
current search direction. 

Wk+i = Wk + ttkPk , where ak is the learning rate. 

Then the next search direction is determined such that it is conjugate to previous search 
directions. The general procedure to determine the new search direction is to combine the 
new steepest descent direction with the previous search direction. 

Pk = -gk + PkPk-i 

The various versions of conjugate gradient are distinguished by the manner in which the 
constant (3k is computed. 

Variable Learning Rate Algorithms : With standard steepest descent, the learning rate is 
held constant throughout training. The performance of the algorithm is very sensitive to 
the proper setting of the learning rate. If the learning rate is set too high, the algorithm 
may oscillate and become xmstable. If the learning rate is too small, the algorithm will 
take too long to converge. It is not practical to, determine the optimal setting for the 
learning rate before training, and, in fact, the optimal learning rate changes during the 
training process, as the algorithm moves across the performance surface. The 
performance of the steepest descent algorithm can be improved if we allow the learning 
rate to change during the training process. An adaptive learning rate will attempt to keep 
the learning step size as large as possible while keeping learning stable. The learning rate 
is made responsive to the complexity of the local error surface. First, the initial network 
output and error are calculated. At each epoch new weights and biases are calculated 
using the current learning rate. New outputs and errors are then calculated. If the new 
error exceeds the old error by more than a predefined ratio, the new weights are discarded 
and the learning rate is decreased. If the new error is less than the old error then the 
learning rate is increased. This procedure increases the learning rate, but only to the 
extent that the network can leam without large error increases. TRAINGDA and 
TRAINGDX are the two variable learning rate algorithms employed in the MATLAB 
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TOOLBOX. Some examples of variable learning rate algorithms are discussed in the next 
two sub-sections. 

Gradient Range Based HeuristicfGRBH'l Method : It is a method which uses a few sets of 
learning rates instead of one fixed value, as shown if figure 3.7. The gradient values are 
divided up into a number of groups according to their values. For each group different 
learning rate is assigned. Large learning rates are assigned to groups that have small 
modulus value of gradient, small learning rates are assigned to those with large modulus 
values of gradient. 



x-axis: Modulus of gradient, 


dw 


; y-axis: Learning rate(ri) 


Figure 3.7 : GRBH Method 


The values of learning rate for each group is chosen at the beginning of the training 
procedure and kept constant. During the weight update procedure, the learning rate for 
each connection weight is found by determining to which group the gradient belongs. The 
gradient of each weight changes with every iteration, hence the learning rate changes 
with every iteration. Also, different weights have different gradients and therefore each 
weight will have a different value of learning rate. 


Inverse Gradient Method : In this method the learning rate is calculated as a fraction of 
the inverse of the gradient. Learning rate is given as: 


r\= 


k 

dw 


, where k is a scaling coefficient 


(3.32) 
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Therefore the weight change is calculated as follows: 


Wij(n+1) = Wij(n) + r| 


dE 

dwy 


+ p Ewyin) 


(3.33) 


where (3 is the momentum term 

Application of GRBH and IG algorithms, on XOR and Alphabet recognition problems, 
proved that they perform better than the standard back-propagation. 


Delta-Bar-Delta algorithm : The Delta-Bar-Delta is a method that implements four 
heuristics regarding gradient descent. It was developed by Jacobs (1988). The method 
consists of a weight update rule and learning update rule. The weight update rule is 
applied to each weight Wj,(n) at iteration n through the relationship given by 

Wy (« + !) = M!y (n) - rjy (n + 1) (3 .34) 

dwyin) 


where T](n) is the learning rate for the weight Wj,(n) at update iteration n. 
The learning rate update rule for a given weight Wj/n) is defined as 


LT]y{ri) = 


k ;ifSy{n-l)x5yin)>0 

'-(l>riy(n) •,if5y{n-\)x5y{n)<0 


0 


■, otherwise 


(3.35) 


where 


Sy{n) = 


dE{n) 

dwy{n) 


(3.36) 


the partial derivative of the error with respect to wt/ji) at iteration n, and 
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Sy{n) = i\-9)5yin) + 9Sy{n-\) 


( 3 . 37 ) 


where k and ^ are constants used increment or decrement the learning rate 
respectively , and 0 < 0 < 1 is an exponential “smoothing” base constant for the n‘^ 
iteration. 

The heuristics implemented are as follows: 

1 . Every parameter (weight) has its own individual learning rate. 

2. Every learning rate is allowed to vary over time to adjust to changes in the error 
surface. 

3. When the error derivative for a weight has the same sign for several consecutive 
update steps, the learning rate for that weight should be increased. This is because 
the error surface has a small curvature at such points and will continue to slope at 
the same rate for some distance. Therefore, the step-size should be increased to 
speed up the downhill movement. 

4. When the sign of the derivative of a weight alternates for several consecutive steps, 
the learning rate for that parameter should be decreased. This is because the error 
surface has a high curvature at that point and the slope may quickly change sign. 
Thus, to prevent oscillation, the value of the step-size should be adjusted 
downward. 

3.4 Improving Results 

Faster training is great but ultimately you want the best possible results on a test set. This 
section discusses the techniques involved in achieving better results. The possible options 
are : 

• Over-fitting 

• Bad Generalization 

• The Size of the Network 

• Combining Network Outputs 
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• Weight Decay 

• Acceleration Algorithms 

• Weight Pruning 

• Pruning Hidden Layer Units 

• Extracting Rules 

3.4.1 Over-fitting: Backprop can fit any function even if the form of the function, to 
be fitted, is not knov/n. This leads to over-fitting, as the training goes on the network will 
end up fitting the training set data very closely while running the results on the test set. 
Figure 3.8 indicates the problem of tying to fit the line y = x + 1 (Donald R Tveter, 
Pattern Recognition Basis of Artificial Intelligence)). The points marked with asterisk are 
training set points set just above and below the exact line, this is typical of real world 
measurements, the deviation from ideal is said to be result of noise in the data. The best 
fit came at 2700 iterations and then the over-fitting began. Quite often people who want 
to fit a function want to extrapolate beyond the range of the training data and this could 
be disastrous. As seen in the figure beyond x =1 the slope has gone down to nearly 0 
instead of being close to 1 . Even at 2700 iterations the network has placed its line very 
close to the left and right data points thus even results near the edge of the range of the 
training data are often poor. 

3.4.2 Bad Generalization: In a pattern classification problem there is no guarantee 
that the backprop network is going to come up ivith a sensible way to partition the 
boundaries between the classes. The following is an example considered by D Tveter . 
Figure 3.9(a) shows two linearly separable classes with four points each marked A and B, 
a straight line separates them. Figure 3.9(b) shows linearly non-separable classes(created 
by adding two points in each class), it shows the curve required to separate them. Figure 
3.9 show the boundaries created when backprop was used. In most of the cases the 
training resulted in networks like figure3.9(c). Sometimes it came up with stranger 
solutions like 3.9(d) and 3.9(e) . As can be seen from the lower portion of figure 3.9(e), it 
has been classified as class A whereas no points belonging to A is present there. Such 
odd generalization can be minimized by averaging results over a number of networks. 
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3.4.3 The Size of the Network: It is widely stated that you will get best results on the 
test set wit a relatively small number of hidden neurons. In fact one rule is that you need 
at least as many training set pattern as there are weights in the network. On the other hand 
there are reports that in certain cases networks with many more weights than patterns also 
work well. 

3.4.4 Combining Network Outputs: One of the techniques used to improve results is 
to combine estimators (network ouqjuts) of many networks, the equivalent of getting the 
opinion of many experts. Suppose the solution to a problem is a straight line. A network 
is unlikely to find that straight line, it will find a curvy line near the straight line. 
Probably another network will find a different curvy line near the line. By averaging a 
number of such results the curve may tend to cancel out and it may give a straight line. 
Further there have been experiments that show that giving the results from one network a 
higher weight than another network also helps. One advantage of these methods is that 
they give good results even with rather poorly trained networks. 

3.4.5 Weight Decay: If the form of the frmction being fitted is known then it is 
better if some standard statistic routine is used on it. For example if the function is linear 
one can do a simple least square analysis and find the answer faster than by using 
backprop. With backprop the network does not know the form of the function and it may 
overfit the function. This can be avoided by using smallest possible number of hidden 
layer units. Another way is to use weight decay, a method that tries to keep the weight 
small. The idea is to subtract a small fraction from each weight at every pass through the 
network. If the weight is w and the tuiny fraction is lambda * w, then use w = w - lambda 
* w . 

3.4.6 Weight Pruning: Another method to improve results is to prune away excess 
weights. The general architecture of Neural Network model contains a large number of 
artificial neurons and weight connections for carrying various operations. The exact 
number of hidden nodes or the connections required is very difficult to formulate. 
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Researchers have indicated that less number of neurons result in inappropriate modeling 
of the input-output. On the other hand, a large number of neurons result in increased 
hardware implementation cost and complexity in manufacturing. Similarly, the multi- 
input systems require a large number of inputs. Some of these have less contribution 
towards the input-output behavior of the system. Weight pruning method suggests 
deleting the weights, which have less importance in the network starting from a large 
sized network. The three algorithms available for this purpose are: 

• Penalty function methods 

• Sensitivity analysis method 

• Iterative pruning method 
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(b) 




(c) 


(d) 


Figure 3.8 : Plots showing Over-fitting (for y = x + 1) 
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(a) 



(b) 


Figure 3.9 ; Bad Generalization 
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Chapter 4 


Learning 

One of the most significant attributes of neural network is its ability to learn from its 
surrounding, and thus improve its performance. Learning takes place by an adaptive 
process, known as [earning ruk or algorithm, whereby the weights of the network are 
adjusted so as to improve the performance. Learning process can be viewed as an 
optimization process, or a search in the weight space for a solution. The learning rules 
can be classified into supervised, unsupervised or reinforcement learning. Various 
learning algorithms fi'om each of these categories have been implemented in the design of 
neural network. A prescribed set of well defined rules for the solution of a learning 
problem is called a Learning Algorithm . Basically the learning rules differ from each other 
in the manner in which they adjust the weights of the synapses. 

4.1 Various Learning Paradigms : The three learning paradigms are as follows: 

i) Supervised learning 

ii) Reinforcement learning 

iii) Unsupervised learning 

4.1.1 Supervised Learning: In this form of learning each input pattern is related to a 
specific desired output, as shown in figure 4.1. The weights are adjusted at each step to 
reduce the difference between the desired and the actual output. This learning can be 
perforpied either on-line or off-line. Back-propagation algorithm uses supervised 
learning. The basic disadvantage of this method is that without a teacher it cannot learn 
new strategies. 
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Figure 4.1 : Supervised Learning 

4.1.2 Unsupervised Learning : This is also referred to as self-organized learning. This 
requires no teacher or critic to oversee the learning process. There are no specific 
examples of the function to be learned by the network. The aim here is to optimize some 
performance function defined in terms of output activity of the units in the network. Once 
the network is tuned to the statistical regularities of the input data, it develops the ability 
to form internal representations for encoding features of the input and thereby create new 
classes automatically. Competitive learning is a type of unsupervised learning. 

4.1.3 Reinforcement Learning :It involves updating the weights in response to an 
“evaluative” teacher signal; it is different from supervised learning where the teacher 
signal is the “correct answer.” Its on-line learning of an input-output mapping through a 
process of trial and error designed to maximize a scalar performance indey^ called a 
reinforcement signal. The basic idea behind this learning is to learn an evaluation 
function, so as to predict the cumulative discounted reinforcement to be received in the 
future. 


Various Learning Algorithms 


Neuron is an adaptive element, its weights are modifiable depending upon the 
input signal it receivers, its ouQjut value and the associated teacher response, as shown in 
figure 4.2. In some cases the teacher signal is not available and no error information can 
be used thus neurons will modify wt, based on input and/or output. 

Wi = weight vector ; wy = weight between input and i* neuron 
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Figure 4.2 : Supervised Learning 

Under different, learning rules, the form of the neuron’s Activation ^Function may be 
different. Threshold parameter may be included in learning as one of the weight. This 
would require fixing one of the inputs, say Xn. We assume here that Xn is fixed and Xn =-l 

A general learning rule adopted in neural network 

W-i = [^/i» increases in proportion to the product of input X 

and learning signal r. r is a fimction of and X and sometimes of teacher’s signal di 


Therefore, 


r = r (Wii, X, di) 

(4.1) 

A Wi(t) = Cr {^(t), X(t), di(t)] X(t) 

(4.2) 

When C is a positive number called kaming constant, it determines the rate of learning. 

Weight at t+1 Avill be thus 


w,. (t + 1) = w, (0 + Cr \w, (0, x(0^, (Oj xiO 

(4.3a) 
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using discrete time training steps 


= w? + ,x\di'‘) X (4.3b) 

The learning so far has been a sequence of discrete-time weight modification continuous- 
time learning can be expressed as 


dW(t), 

dt 


= CrX{t) 


(4.4) 


Her we assume that the weights have been suitably initialized before each learning 
experiment was started. The various learning algorithms implemented in the design of 
neural network can be listed as follows: 

i) Error-correction learning 

ii) Delta learning rule 

iii) Hebbian learning 

iv) Competitive learning 

v) Boltzmann learning 

vi) Widrow-Hoff learning 


4.2. 1 Error Correction Learning ; This rule was originally proposed for a single-unit 
training. This rule drives the output error to zero. Let 
dk (n) = Desired response of neuron k at time n 
yk(n) = Actual response 
X(n) = Input to the Network 
The error signal is given by: 

ek(n) = dk(n)-yk(n) (4.5) 

The ultimate aim of error correction learning is to minimize a cost function based on the 
error signal ek(n) such that the actual response of each output neuron approaches the 
target response. The criterion used for selection of cost function is mean square error criterion, 
defined as the mean-square value of the sum of squared errors : 
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J = E 


2 




(4.6) 


Where, E - Statistical expectation operator 
^ = Summation over output neurons 

k 

J has to be minimized with respect to the free parameters. Minimization of cost 
function(J) with respect to the network parameters leads to the so called metliocf ofgracBent 
descent. Another easier method or cost function or criteria uses instantaneous value of the 
sum of square error; 


£(■>)= 

^ K 

The network is optimized by minimization of s(n) with respect to the synaptic weights, as 
per the error correcting learning mle (or delta rule) the adjustments, Aw^^.(n) made to 
Wkj(n) is given by: 

(«) = n ek(n) Xj(n) (4.8) 

Where riis the learning rate, it’s a positive constant. Thus from the above equation we see 
that the adjustments in the sjoiaptic weights is proportional to the product of error signal 
and the input signal of the synaptic in question. 

Thus, 


ek(n) = dk(n)-yk(n) 

1 ek(n)xj(n) 

Wkj(n+1 ) = Wkj(n) + (n) (4.9) 


Error correction learning is a closed feedback system. Here the choice of rj is of great 
significance, a very small value would give a smooth convergence, but it will consume 
more time, whereas a high value of q would give quick convergence but it involves the 
danger of divergence. 
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4.2.2 Delta Learning Rule : (Supervised) : This is valid only for continuous activation 
functions, and it is supervised learning scheme, as shown in figure 4.3. The learning 
signal for this rule is called dkCta and is defined as: 

r A [c/,. - / (w' x)J /' (w' x) (4. 1 0) 

/ derivative of activation function f(net), where net = 'w]x 



This learning rule is derived from condition of least square error between Oi and di. 
Calculating the gradient vector w.r.t. y^ of the squared error defined as: 

(4.11a) 

It is equivalent to 

(41 lb) 

the error gradient vectorvahiQ is given by 

VE=-{d,-o,)f\W,^x 
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(4.12a) 





the components of the gradient vector are 


I j i , . 

for j = 1,2 n (4.12b) 

Minimization of error requires the weight changes to be in the negative gradient 
direction, thus we take 


Aw, =- 7 V£ (4.13) 

T] = +ve constant 

from equation 4.12a and 4.13 we have 


Aw, =;/(£/, -o,)/'(«eOx 


(4.14a) 


and for a single weight adjustment 

AWy = rjid, - o,)/(ner,)x^ for j = 1, 2 n (4.14b) 

Therefore, the weight adjustment in the above equation is computed based on 
minimization of the squared error. Using the general learning rule, we have 

Aw,(0 =Cr|w,(0, x{t), d,{t)\x(t) 

and plugging the learning signal given by 2.36., the weight adjustment becomes 


Aw, = c{d, - o, )/' (ner, )x (4. 1 5) 

which is similar to (4.14), since c and rj are arbitrary constants. Thus, 

► The weights are initialized at any values for this method of training 

► This rule is also called the Continuous Perceptron Training Rule 

► Delta Learning Rule can be generalized for multi-layer networks 
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Delta learning rules requires the calculation of /{net) at every step. Therefore, we use 
these equations; 

2 

/ {net) = 1 ; Continuous bipolar activation ftinction 

1 + exp(-ncr) 


/{net) = 


2exp(-«er) 

[1 + exp(-ner)]^ 


O = f{net) 
Therefore, 


2 2 



exp(-neO 
1^1 + exp(-«et) 


y 


2 exp {-net) 

[l + exp( -net )P 


/{net) 


Therefore, /' {net) = — (l - ) 

2 

Equation 4.16 is valid for bipolar continuous activation functions. 


(4.16) 


4.2.3 Hebbian Learning : “If two neurons on eidier side of a synapse are activated 
simultaneously then the strength of that synapse is selectively increased. If the two 
neurons on either side of a synapse are activated asynchronously, then that synapse is 
selectively weakened or eliminated”. Such a synapse is called tKeSSian Synapse . Refers to 
figure 4.4. 

According to Hebbian learning, the adjustments applied to Wkj at time n is 
expressed in the form: 


(”) = F( yk(n)» Xj(n)) (4.17a) 

F(. , .) is a function of both post-synaptic and pre-synaptic activities. As a special case we 
can write : 
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= T1 yk(n) Xj(n) 


(4.17b) 


This rule shows the correlation nature of the Hebbian synapse Wkj, expressed as product of 
incoming and outgoing signals. There are modifications to the synaptic adjustment to 
ensure that it does not saturate Wkj. The modifications are as follows: 

(”) = T] yk(n) Xj(n) - a yk(n) Wkj(n) (4.18) 

Aw/^j (n) = a yk(n) [cx^ («) - (m)J (4.19) 

7 

c = — 
oc 

This equation implies that for inputs for which Xj(n) < Wkj(n) /c, Wkj(n+1) will reduce(at 
time n+1) by an amount proportional to postsynaptic activity yk(n) and vice versa as 
shown in the graph. 



Figure 4.4 : Hebbian Learning Graph 

The activity balance point for modifying the synaptic weight at time n+1 is a variable, 
Wkj/c, it is proportional to the value of wy at the time of pre-synaptic activity. This 
approach eliminates the problem of runaway synaptic weight instability and results in 
negatively accelerating modification curve. Another way of formulating the Hebbian 
postulate is to make the change in Ihe synaptic weight proportional to covariance between 
pre-synaptic and post-synaptic activities. According to this rule, on an average, the 
strength of the synapse increases if the post-synapse and the pre-synapse activities are 
positively correlated and decreases if negatively correlated. 
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4.2.4 Competitive Learning: The output neurons of a network compete among 
themselves for being the one to be activated or fired. At a time only one output neuron 
can be active, as shown in figure 4.5. The three basic elements to a competitive elements 
are: 

• A set of neurons that are all same except for randomly distributed weights, and 
therefore respond in different manner to a given set of input patterns. 

• A “limit” imposed on the “strength” of each neuron. 

• A mechanism that permits the neurons to compete for the right to respond to a given 
input/s, such that only one output neuron is active at a time and this neuron is called “ 
Winner-takes-all” neuron. 


Thus the individual neurons of the network learn to specialize on sets of similar patterns, 
and thereby become /eature detectors. 



Source Node Single Layer 

Layer of Output Neurons 


Competitive Learning Network 


If the internal activity level for a given input, say x, is the largest, then it is the winning 
neuron. Its output is made 1 and output of the rest are 0. 
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(4.20) 


^ Wj, =1 for all j (output neurons) 

/ 

r •n(Xi - Wj,) if neuron j wins the competition 
^^ji~ 1 0 if neuron j loses the competition (4.21) 


4.2.5 Boltzman Learning : It’s a stochastic learning algorithm. The neurons constitute 
a recurrent structure and they operate in binary manner, i.e. they are either “on”(+l) or 
“off ’(-1). The machine is characterized by energy function E. 

E = WjiStSj (4-22) 

^ I J 

Where, Sj = state of neuron I 

Wji = weight of link between neuron I and j 
i/^j ; no feedback 


4.2.6 The Widrow-Hoff learning rule 

The Widrow-Hoff learning rule (Widrow, 1962) is applicable for supervised 
learning of neural networks. It is independent of the activation function of the neurons 
since it minimises the squared error between the desired output value 4 fh® neuron’s 
activation value netj = Wj*x. The learning signal for this rule is defined as follows; 

rLdj-netj (4.23) 

The weight vector increment under this learning rule is 

= c{dj - netj ^ (4.24) 


or for die single weight adjustment is 
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AWy =c(dj~netj)x, 


■, fori = 1,2, 


n 


(4.25) 


This rule can be considered a special case of the delta learning rule with linear 
activation function i.e. in the equation (4.10), /(net) = net mdffnet) ~ 1 and the subject 
equation becomes identical to equation (4.23). This rule is, sometimes called the LA^ 
(least mean square) learning rule. Weights are initialised at any values in this method. 
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Chapter 5 


Benchmark Problems 


There are various benchmark problems that have been developed to analyse the 
performance of multi-layered feed-forward neural networks. These problems are different 
from the common problems encountered, in the sense that they cover various aspects that 
are required to be tested for before branding any algorithm as successful. Researchers to 
test various performance characteristics of Multi-layer Feed-forward networks and to 
verify theoretical results through simulation use these typical problems. In this thesis, an 
attempt has been made to solve these benchmark problems as well as a few more typical 
model problems in classification and function approximation areas using multi-layered 
feed-forward networks. These problems have been solved with network models 
developed using MATLAB neural network toolbox. Further, these solutions have been 
compared with those obtained using models developed in JAVA language. The codes 
written in JAVA need continuous refinement to match the performance of MATLAB. 

5.1 Exdttsive-OR (XQR^ Problem 

A single layer perception has no hidden layer. It cannot classify the input patterns that are 
not linearly separable. XOR is an example of linearly non-separable, as shown in figure 
5.1. It can be viewed as a special case of a more general problem, namely, that of 
classifying point in a unit hypercube (XOR is the case of dimension). 

Each point in the hypercube belongs to class 0 or class 1 . 


Inout Pattern 

Output pattern 

0 0 

0 

0 1 

1 

1 0 

1 

1 1 

0 


OXORl = 1 

li.e.(l,0)and(0,l) 

1XORO = 1, 

J belong to class 1 
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X 



The use of a single neuron with two inputs results in a straight line for the decision 
boundary in the input space. For all the points on one side of this line, its output is 1 and 
for the points on the other side, the neuron output is 0, The position and the orientation of 
the line depend on the weight and the threshold. In case of XOR the decision boundary 
cannot be a straight line. Elementary perceptron cannot solve the XOR problems. 

XOR can be solved by using a hidden layer with two neuron, as shown in figure 5.2. 
—Each neuron is represented by McCulloh - Pitts model. 

-Bits 0 and 1 are represented by 0 & +1. 



The top neuron (neuron 1) is characterized as follows: 

Wii = Wi2 = +l ; 0i= + 3/2 

The bottom neuron (neuron2) is characterized as follows: 
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W21 = W22= 1 ; 02 = + 1/2 

The output neuron (neuron 3) is characterized as follows: 

W3i = -2 
W32 = +l 
03 = + 1/2 


(0,1) (1,1) 



Neuron 2 has excitatory (+ve) connection to the output neuron whereas Neuron 1 has 
inhibitory (-ve) connection to the oulput neuron. 

• When both hidden neurons are OFF(i.e. Input = ( 0,0 )) then the output neuron 
remains OFF. 

• When both hidden neurons are ON (i.e. Input = ( 1,1 )) then the output neiuons is 
switched OFF again as the inhibitory effect of the large negative weight 
connected to the top hidden neuron overpowers the excitatory effect of the 
positive weight connected to the bottom hidden neuron. 

• When top neuron is OFF ( Input = ( 0,1 ) & ( 1,0 )) and bottom neuron is ON the 
output neuron is switched ON due to excitatory effect of the positive weight 
connected to the bottom hidden neuron - Thus XOR is solved. 

5.2 The N - Parity Problem 

Problem; Mapping of N-bit wide binary number into its parity. If the input pattern 
contains an even number of 1® then its parity is 1 else its 0, as shown in table 5.1. This is 
a difficult problem because the pattern that are closer (using Euclidean distance) in the 
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5.3 The Two Spiral Problem; (Proposed; Alexis Wieland of MITRE Irp.) 

It’s a benchmarking problem, since it is extremely hard to solve using Back propagation 
algorithm. It’s a problem of classification. The problem consists of Np=194 (X,Y) pairs 
of points that lie in interlocking spirals (Np/2 on each) that goes around a common origin 
three times. Each spiral belongs to a different class and the network has to associate 
points in the sample space with the correct class ( + 1, -1 ). Refers to figure 5.5. 

15 

ID 

5 

0 

•5 

•10 

-15 

.15 -10 -5 0 5 10 15 



Figure 5.5 : TwoSpiral 

5.4 The Encoding /Decoding Problem : Ackely, Hinton and Sejnowski 
posed a problem for internal representation testing in which a set of Ni orthogonal input 
pattern are mapped to a set of No orthogonal input patterns through a small set of hidden 
neurons ,Nh. This benchmark forces the hidden layer nodes to be efficient. The 
expectation is that the hidden units will form a binary encoding of the Ni- bit input pattern 
on to a Nn-bit pattern and then from a binary decoding of Nn-bit pattern on to a No-bit 
output pattern. 

Nh ~ log2N 
and 

Ni = No = Nh 


Input Neurons 
Ni 



Output Neurons 
No 


Figure 5.6 : Encoding-Decoding Network 
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This benchmark problem is important to evaluate the network storage capacity and it has 
application to data compression and transmission. 


5.5 Logic/Arithmetic Problem: The logic/ arithmetic operation problems have 
been studied since the conception of ANN. Minsky and Papert ( 1969 ) used the XOR 
problem to show the limitation of single layer preceptrons. The basic symbol for 
combinatorial logic gates can be shown as below. By combining these gates, any logic 
statement can be implemented. 



XOR NOR NAND AND OR NOT 


Figure 5.7 : logic Gates 

They can also be combined to generate temporal components useful for sequencing. 
ANN can also be formed that imitate the same logic elements. 



Figure 5. 8 : ANN minimum size adder of 2-bit numbers. 
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As shown in the figure 5.8, two more logic/arithmetic problems of importance are 
the binary addition and the binary negation. The binary addition has the feature of being 
able to find the local minimum. Hence it can it can be studied for network and learning 
algorithms to check for the manner in which they find and avoid local minima. The 
binary addition of a two-bit binary number gives at most a 3-bit number. 

The NAND and the NOR, functions can be described by the network shown below with 
McCulloh Pitts models. 



= 1 

If 

>= T 

n 

o 

II 

If 

/=i 


Excitatory 

Input 

Input 

Inhibitory 
Input 
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+1 



> 0 


Figure 5.11 : NAND Gate 

The binary negation problems consists of an (n + 1) bit number (bo, bo, bn, bn+i)i as 

input ( n-bit data and an extra negation bit bw = bN+i ) The output is an n-bit number 

(bo,bi, bn )o, that is equal to the input pattern if Ijn is 0, and the output is the 

complement of input. 

If bn = 1 i.e. 

(bo, bi, bn )o = ( bo, bi, bn ), if bN = 0 


= (bo,bi, bn )i ifbN=l 

The problem reduces to a set of n XOR problems between the negation bit and each of 
the inputs (bk)i such that when bn = 0 

( bk )o = ( bk )i with ( bk ), = 0 @ bN = 0 

( bk )i = 1 @ bN = 1 

And when, Ijn = 1 

(bk)o=(bk)i with (bk)i=0 @ bN=l 

(bk)i=l @ bN = 0 

5.6 The Accuracv/Classification Problem ; Another measure of performance 
is the network accuracy, A way of testing this is to define a classification problem where, 
there is boundary layer between classes where the error is acceptable i.e. not penalized or 
less penalized. 
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A general geometry can be specified for the classification. The basic data is 
shown in the figure below where the decision boundary is fuzzy i.e. a ‘Boundary layer’ 
(BL) is defined where no training data is present and perfect recall is expected. The BL 
thickness, Ac (Euclidean distance between classes) is the relative accuracy for separation 
for a given network. The smaller the A c for a network, the more accurate it can be. 

This problem can be used to determine the effectiveness of the learning 
algorithms to converge to the desired accuracy. Conventional forms of back propagation 


yields a learning time that diverges like 1/s^, where 




Typically, the patterns belonging to a class are not all same, i.e. the patterns used for 
testing and those for training may not be the same. Pattern classification problems are 
said to belong to the category of supervised learning. 

5.7 Majority Vote ; Majority vote networks are trained to output a one when more 
than half of the binary inputs are ‘on ’(one). Otherwise, the network outputs a zero. These 
networks typically have an odd number n of inputs and a single output. The number of 
hidden units will vary, but typically will be less than the number of inputs. 
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5.8 The Sin(x)Sin(v^ Problem 2 General function mapping problems have been 
used by different researches to test their network capabilities, learning algorithms etc. A 
popular function approximation or function mapp ing is: 

Z(x,y) = Sin(x) Sin(y) 

This function gets more complicated as the norm of the input vector( x,y ) grows. It falls 
in the category of two-input-one-output problem. Refers to figure 5.13. 



Figure 5.13 : Sin(x) Sin(y) Plot 


5.9 Fiinction Approximation ; A multi-layer perceptron trained with back- 
propagation algorithm can be viewed as a practical vehicle for performing non-linear 
input-ou^ut mapping of a general nature. Suppose p denotes the number of input nodes 
of a multiplayer perceptron, and q denotes the number of neurons in the output layer of 
the network. The input-output relationship of the network defines a mapping from a p- 
dimensional Euclidean input space to a q-dimensional Euclidean output space, which is 
infinitely continuously differentiable. The fundamental result states that a two layered 
feed-forward network with a sufficient number of hidden units, of the sigmoidal 
activation type, and a single linear output unit is capable of approximating any 
continuous function f: R" — ► R to any desired accuracy. The above statement has been 
proved and stated in the form of various theorems( Kolmogrov’s Theorem, Cybenko’s 
theorem Homik et al ). All tiiese people independently proved the theorem that one- 
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hidden-layer feed-forward neural network is capable of approximating uniformly any 
continuous multivariate function to any desired degree of accuracy. Homik et. al proved 
another important result relating to the approximating capability of multi-layer feed- 
forward neural networks employing sigmoidal activation functions. They showed that 
these networks can approximate not only an unknown function but also its derivative. 
These results have further been extended to any continuous function f on R" , that these 
can be accurately approximated by adjusting the weights and thresholds only. 

5.10 Character Recognition ; Multi-layer feed-forward networks can be trained to 
solve a variety of data classification and recognition problems. The network can be used 
for handwritten as well as printed letter or alphabet or integer or symbol recognition. The 
input character is first normalized to extend it to full height and width of the bar mask. 
The characters are required to be encoded and presented to the network for training. The 
method of encoding the characters is of great significance, the endeavor must be to 
encode in a manner so as to create maximum separation between the characters. This 
ensures that a properly trained network can even recognize noisy data easily. In the 
present work recognition of integers from 0 to 9 and recognition of English letters have 
been considered. The encoding has been done using 8x8 matrix, for each character. For 
better performance of the net, it should be trained for noisy data also. During training, the 
network was trained to associate outputs with input patterns. When the network is used, it 
identifies the input pattern and tries to output the associated output pattern. The power of 
neural network comes to life when a pattern that has no output associated with it, is given 
as an input. In this case the network gives the output that corresponds to a tau^t input 
pattern that is least different from the given pattern. 

5.11 Curve Fitting ; Neural network is capable of performing curve fitting. It can 
be trained for some portion of the curve (function) and can be used to predict the future 
points on the curve. Time series method has been used in this work for curve fitting 
problems. Typically previous three or four values are used for predicting the next value 
on the curve, the network is trained in this manner and is then used to predict the future 
values. It has been found that in case of curves like the sine(or cosine), the network 
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requires to be trained for one complete cycle(2*pi) and it can be used to predict further 
values. In case of sineexp function, the network needs to capture only a part of the curve, 
and further predictions can be done by the network. 
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Chapter 6 


Results and Discussions 

In the present thesis work, some of problems that have been discussed in the previous 
chapter (Chapter 5) have been solved. These are basically benchmark problems and require 
rigorous efforts to find an efficient solution. Most of these problems do not have a direct 
solution and require searching in a very methodical manner for the ideal solution. Most of 
these have an optimum solution and the effort has been to find this or something very close 
to this, since it would be very difficult to locate the global optimum solution. Back- 
propagation algorithms have been employed, along with its variants. Both first order and 
second order algorithms have been used for this purpose. The solutions found using 
MATLAB have been tested on the codes written in JAVA. The codes written in JAVA is 
relatively new and has got some constraints, however, most of the solutions found using 
MATLAB are producing satisfactory results. In the following sections are discussed the 
solutions to the various problems which have been considered in this thesis work. 

6.1 The Exciusive-OR Problem 

Theoretically, this problem is linearly non-separable and cannot be solved by the 
use of a single neuron in the output layer. It can be solved by feed-forward neural network 
having single hidden layer with at least two neurons. Traingd (MATLAB) was the training 
algorithm used and the activation functions used for the hidden layer and the output layers 
were logsig and logsig respectively. Logsig is same as unipolar sigmoidal activation 
function. Beside the theoretical structure various other structures have also been found that 
give good results in lesser iterations. The results and simulations are shown in figure 6.1 
for MATLAB and figures 6.2 and 6.3 for JAVA architectures respectively. The various 
structures used are listed below in table 6.1 . 
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MATLAB RESULTS 


Table No 6.1 :Exclusive-QR 


Exclusive-OR: Artificial Neural Network Architectures nseH- Rrrnr Goal = 0 0001 

Network 

Hidden Lavers 

Neurons 

Activation Fn 

Algorithm 

Iterations 


1 

3 

Logsig, Tansig 

Trainrp 

310 

XOR2 

1 

3 

Logsig, Logsig 

Traingdm 


XOR3 

1 

3 

Tansig, Tansig 

Traingd 

866 

XOR4 

1 

2 

Logsig, Logsig 

Traingd 

33929 

XOR5 

1 

3 

Tansig, Logsig 

Traingd 

9597 

XOR8 

1 

3 

Tansig, Logsig 

Trainrp 

48 


Trainrp = Resilient Propagation Algorithm 
Traingd = Standard Back-propagation Algorithm 
Traingdm = Standard Back-propagation with Momentum 


JAVA RESULTS 
Table No 6.2 :Exclusive-OR 


Exclusive-OR: Artificial Neural Network Architectures used; Error Goal = 0.0001 

Network 

Hidden Lavers 

Neurons 

Activation Fn 

Aleorithm 

Iterations 

XORl 

1 

3 

Logsig, Tansig 

Trainrp 

176 

XOR3 

1 

3 


Traingd 

744 


1 

2 

Logsig, Logsig 

Traingd 

2000 

XOR8 

1 

3 

Tansig, Logsig 

Trainrp 

77 


Conclusion : From the above results it is concluded that:- 

(a) With only two neurons in the hidden layer, it requires large number of iterations to 
achieve proper classification. 

(b) Resilient propagation has given the fastest training, in both MATLAB and JAVA. 

(c) JAVA codes were faster in training. 

(d) The theoretical structure gave faster result in the case of JAVA. 
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XORl; [3 1] {Logsig Tansig} Trainrp ; Epochs =310 



XOR3; [3 1] : {Tansig Tansig}Tramgd; Epochs=866 



XOR4:[2 11 :{Lo^ig Tansig} Trainrp; Epochs = 33929 


Performance is 9 999916-005, Goal is 0 0001 



Figure 6. 1 : Training Plots (XOR) 





JAVA PLOT 


XORl; [3 IJ {logsig tansig} Trainrp ; Epochs =176 



XOR3: [3 1] : {tansig tansig}Tramgd; Epochs=1^8- 

miA 



Figure : E xclusive-OR Error Plot 






JAVA PLOTS 


XOR 8: [3 1] {Tansig Logsig } TrainRP; Epochs : 77 



XOR 4: [3 1] {Logsig Logsig} TrainRP Epochs: 2000 



Iterations 


Figure 6.3 : Exclusive-or Error plot 





6.2 4-Paritv Problem 


In this work 4-bit parity problem has been considered. If the input pattern contains an even 
number of 1® then its parity is 1 else its 0. This is a difficult problem because the pattern 
that are closer (using Euclidean distance) in the measure (sample) space, i.e. numbers that 
differ in only one bit require their answer to be different. The results and simulations are 
shown in figures 6.4 -6.5 for MATLAB and figures 6.6 to 6.9 for JAVA. For N = 4 the 
training set is: 

Table 6.3:4- Laver Training Set 
~b3 0000000011111111 

b2 0000111100001111 

bl 0011001100110011 

bO 0101010101010101 


Parity 1001011001101001 


The general network architecture used for N-Parity problem is shown in figure 5.4. For the 
present problem various architectures were found that gave good results. These 
architectures are listed below: 

MATLAB RESULTS 
Table 6.4: 4-Bit Parity Problem 


4-Bit Parity Problem : ANN Architectures used; Error Goal = 0.01 


Network 

Architecture 

Activation Fn. 

Aleorithm 

Iterations 

Parity 

[4 4 1] 

{Tansig Tansig Tansig} 

Traingd 

1279 



{Tansig Tansig Tansig} 

Trainrp 

214 

Parity! 

[4 4 1] 

{Tansig Tansig Logsig} 

Traingd 

No Convergent 

ParityS 

[4 4 1] 

{Tansig Logsig Logsig} 

Traingd 

No Convergent 

Parity4 

[4 4 1] 

{Logsig Logsig Tansig} 

Traingd 

No Convergent 

Parity5 

[4 4 1] 

{Logsig Logsig Logsig} 

Trainrp 

No Convergent 

Parity6 

[4 4 1] 

{Tansig Logsig Logsig} 

Trainlm 

31 

Parity? 

[4 4 1] 

{Tansig Tansig Logsig} 

Trainscg 

168 


76 

































ParityS 

[4 4 1] 

{Tansig Logsig Logsig} 

Trainscg 

No Convergenc 

Parity9 

[4 4 1] 

(Logsig Logsig Logsig} 

Trainscg 

58 

ParitylO 




758 

Parityll 

[4 4 1] 

(Tansig Tansig Logsig} 

Trainrp 

No Convergenc 


JAVA RESULTS 

Table 6.5: 4-Bit Parity Problem 


4-Bit Parity Problem : ANN Architectures used; Error Goal = 0.01 


Network 

Architecture 

Activation Fn. 

Aleorithm 

Iterations 

Parity 

[4 4 1] 

(Tansig Tansig Tansig} 

Traingd 

No Convergenc 

Parityl 

[4 4 1] 

(Tansig Tansig Tansig} 

Trainrp 

261/313 

Parity2 

[4 4 1] 

(Tansig Tansig Logsig} 

Traingd 

24313/5349 


[4 4 1] 

(Tansig Logsig Logsig} 

Traingd 

No Convergenc 

Parity4 

[4 4 1] 

(Logsig Logsig Tansig} 

Traingd 

No Convergenc 

Parity5 

[4 4 1] 

(Logsig Logsig Logsig} 

Trainrp 

10000, Error 

Reached=.016 

Parityb 

[4 4 1] 

(Tansig Logsig Logsig} 

Trainlm 

17395 

Parity7 

[4 4 1] 

(Tansig Tansig Logsig} 

Trainscg 

302 

ParityS 

[4 4 1] 

(Tansig Logsig Logsig} 

Trainscg 

322 

Parity9 

[4 4 1] 

(Logsig Logsig Logsig} 

Trainscg 

1100 

ParitylO 

[4 4 1] 

(Tansig Tansig Tansig} 

Trainscg 

1572 

Parityll 

[4 4 1] 

(Tansig Tansig Logsig} 

Trainrp 

1080 


Conclusion : From the above results it is concluded tiiat;- 

(a) Trainlm (Levenberg-Marquardt Back-propagation) was Ihe fastest algorithm 
followed by the Scaled Conjugate Gradient Back-propagation, in the case of 
MATLAB. Whereas, it took lot of iterations (longest) in the case of JAVA. 

(b) Using unipolar sigmoid activation function (Logsig) gave faster result, 
compared to bipolar sigmoidal activation function (Tansig). 
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(c) Resilient works well with most of the architectures in JAVA. It however does 
not work for most of the architectures in MATLAB. 

(d) Standard Back-propagation does not work well with most of the architectures, 
in both MATLAB and JAVA. 

(e) Scaled Conjugate Gradient algorithm works well for almost all network 
architectures. 
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Perfbrmance=0 009, Goal*0.01 



0.9479 

1 

0.0913 

0 

- 0.0264 

0 
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1 
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0 
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Fipiire 6 A : Parity Plots and Simulatioas 
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Figure 6. 6 : P arity Plot 
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Fiaxire 6. 7: Parity Error Plots 
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Figure 6.9: Parity Error Plots and Simulations 
















6.3 CODEC Problem 


The codec problem is a typical identity-mapping problem. Here 16-bit input code 
has been mapped to its own self, i.e. 16-bit output code. The network is trained to produce 
an output binary pattern, which is identical to the input pattern. This forces the hidden units 
to encode the patterns. For example, with n =16 and m = 4, the network must provide an 
equivalent binary login encoding in the hidden layer. When m is small compared to n, the 
task becomes more difficult to learn than when m is near n. This is, sometimes, referred to 
as tight encoding. This benchmark problem is important to evaluate the network storage 
capacity, and it has application to data compression and transmission. A toteil of 1 6 codes 
have been used for the purpose of training. Theoretically this problem can be solved by 
using a single hidden layer with the number of neurons being equal to log 2 N, where N is 
the number of input. Thus 4 neurons have been used in the hidden layer. Two architectures 
have been found that give good results with the restraints of the hidden layer and the 
number of neurons. These have been tested for both MATLAB as well as JAVA. The error 
convergence plot and the results are shown in figure 6.10 for MATLAB and figures 6.1 1 - 
6.12 for JAVA. The two architectures that were found to give proper convergence are: 


MATLAB RESULTS 
Table 6.6: CODEC Problem 


CODEC Problem: ANN Architectures used: Error Goal = 0.0001 

Network 

Architecture 

Activation Fn. 

Algorithm 

Iterations 

Codec 

[4 16] 

{Logsig Logsig} 

Trainrp 

74 


[416] 

{Tansig Logsig} 

Trainrp 

85 


[416] 

{Tansig Logsig} 

Trainscg 

610 


JAVA RESULTS 
Table 6.7: CODEC Problem 


CODEC Problem: ANN Architectures used; Error Goal = 0.0001 

Network 

Architecture 

Activation Fn. 

Algorithm 

Iterations 

Codec 

[4 16] 

{Logsig Logsig} 

Trainrp 

95/96 

Coded 

[4 16] 

{Tansig Logsig} 

Trainrp 

86/203 

Codecs 

[416] 

{Tansig Logsig} 

Trainrp 

>10000 


85 
































Conclusion : From the results obtained it can be concluded that: 

(a) Using unipolar sigmoidal activation function and Resilient Back-propagation 
gives fast convergence and good simulation in case of MAIL AB. 

(b) Using the same architecture in JAVA took comparatively a very long time for 
error convergence. 

(c) ANN built using JAVA requires less rigorous training as far as error goal is 
concerned. 

TrainSCG worked for MATLAB, but it did not show good convergence for JAVA. 
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ERROR 








Simulated Output of Codec 

Columns 1 through 7 


0.0018 
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Columns 8 through 14 
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Simulated Values for Codecs 

Columns 1 through 7 
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JAVA PLOTS 


Codec [4 16] {Logsig Logsig} Trainrp Epochs: 96 



Codec 1 [4 16] {Tansig Logsig} Traiurp Epochs: 203 



Figure 6.tl : CODEC Error Plots 
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JAVA PLOTS 

Codec: [4 16] {Logsig Logsig} Trainrp Epochs : 95 



Codec 1:[4 16] {Tansig Logsig} Trainrp Epochs : 86 



Figure 6. CODEC Error Plots 
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Columns 1 through 7 
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Codec\jAVA) 

Simulated Values 





Columns 1 through 7 
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6.4 Character Recognition Problem 

A popular type of benchmark test used for neural networks is that of character 
recognition. The number of inputs and outputs will vary depending on the set of characters 
used for recognition and the choice of features selected for input. In this work the alphabets 
(A...Z) have been used for recognition. In the most difficult case cmsory handwritten 
character recognition can be attempted. For these tests, the inputs may be individual pixels, 
say 8X8 presented to the network in a particular fashion. The outputs are the binary 
representations of the individual characters. 

In this problem, all the 26 letters have been mapped to 5-bit codes. Each letter has 
been represented by a 64-bit word written using the 8 X 8 pixels. These bits have been 
converted to 64-bit. The words have been read in two different fashions. The method of 
reading affects the performance of the network. The results and the plots are shown in 
figures 6.13 to 6.16 (MATLAB) and figures 6.17 to 6.19 (JAVA). The architectures found 
to give good recognition capabilities are as shown in the table below: 

MATLAB RESULTS 

Table 6.8: Character Recognition Problem 


Character Recognition Problem : ANN Architectures used; 


Network 

Architecture 

Activation Fn. 

Aleorithm 

Error Goal 

Iterations 

Charalpha_l 1 

[10 11 5] 

Tansig Tansig Logsig 

Trainip 

0.0001 

100 

Charalpha_12 

[10 11 5] 

Tansig Tansig Logsig 

Trainrp 

0.001 

80 

Charalpha_21 

[10 11 5] 

Logsig Logsig Logsig 

Trainrp 

0.00001 

103 


JAVA RESULTS 


Table 6.9: Character Recognition Problem : 


Character Recognition Problem: ANN Architectures used; 

Network 

Architecture 

Activation Fn. 

Aleorithm 

Error Goal 

Enochs 

Charalpha 

[10 11 5] 

Tansig Tansig Logsig 

Trainrp 

0.00001 


CharalphaCROl 

[10 11 5] 

Tansig Tansig Logsig 

Trainrp 

0.00001 

61 

CharalphaCR02 

[10 11 5] 

Logsig Logsig Logsig 

Trainrp 

0.00001 

2206 
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In the case of first architecture of JAVA the input data has been read in a different fashion 
(Similar to raster scan, from top to bottom), in the rest of the cases the input data has been 
obtained by reading the pixels similar to CRO scan from top to bottom. In the case of 
MATLAB reading the alphabet in the first fashion (Raster scan) resulted in inferior 
recognition of noisy input. The network were trained and tested for noisy patterns and were 
found to give good recognition capability for the above architectures. 

Conclusion: 

(a) The number of hidden layers and units vary depending on the application, but 
typically, no more than two hidden layers are required. 

(b) For better performance of the net, it should be trained for noisy data also. 

(c) The power of neural network comes to life when a pattern that has no output 
associated with it, is given as an input. In this case the network gives the output 
that corresponds to a taught input pattern that is least different from the given 
pattern. 

(d) The way the pixels are read affects the noise tolerance of the network. 

(e) The performance of JAVA codes was better as far as noisy input was 
concerned. 

(f) Training the network for a very stringy error tolerance causes over-fitting and 
the network leaves no room for recognition of distorted input. 

(g) Reading the patterns top-down and down-top (CRO scan) gives better results. 
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ERROR 


Charalpha [10 11 5] {Tansig Tansig Logsig} Trainrp 
Epochs: 100 
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Figure 6.1^ : Character Recognition (Alphabets) 
(The network was tested for distorted and noisy data 
And was found to have good recognition capabiKties) 
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Fipiire 6. lA : Character Recognition Error Plots 
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Charalpha_21 Simulated Oul^ut with Noisy Input (* = noise) 
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Figure 6.16 ; Simulated Output (Noisy Data ) 
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Figure 6. (Si Simulated Output (Noisy Data ) 
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Fipiire6.ici : Simulated Output (Noisy Data) 



6.5 The SinxSinv Problem 

This is a problem of function approximation. Sin(x)Sin(y) is a complicated graph 
and capturing it using Artificial neural network requires a very systematic search for the 
right network architecture. Care has to be taken with regards error tolerance, a very 
stringent error requirement may lead to over-fitting, and if the error requirement is kept 
very loose then the network may not capture the curve sufficiently to carry out 
interpolation. In this work a total of 625 points have been generated (x and y varying fi:om 
0 to 2 pi), out of these 100 points were used for training and the trained network was used 
to simulate the remaining points. The results and plots are shown in figures 6.20 to 6.33 
(MATLAB) and figures 6.34 to 6.49 (JAVA). The architectures found to give proper 
capturing of the curve are listed below: 


MATLAB RESULTS 
Table 6.10: SinxSinv Problem 


SinxSinv Problem: ANN Architectures used: Error Goal = 0.0001 

Network 

Architecture 

Activation Fn. 

Algorithm 

Iterations 

SinxSiny_n 

[6 5 6 1] 

Tansig Tansig Tansig 

Logsig 

TrainLM 

623 

Sinxlny_nl 

[15 15 15 1] 

Tansig Tansig Tansig 

Logsig 

TrainSCG 

635 

SinxSiny_n2 

[25 25 1] 

Tansig Tansig logsig 

TrainSCG 


SinxSiny_n3 

[20 25 30 1] 

Tansig Tansig Tansig 

Logsig 

TrainRP 

425 

SinxSiny_n4 

[35 35 1] 

Tansig Tansig logsig 

TrainRP 

500 

SinxSiny_n5 

[15 15 15 1] 

Tansig Tansig Tansig 

Logsig 

TrainCGB 

1200 

SinxSiny_n6 

[20 25 30 1] 

Tansig Tansig Tansig 

Logsig 

TrainCGP 



TrainSCG = Scaled Conjugate Gradient Back-propagation 

TrainCGB = Conjugate Gradient Back-propagation with Powell-Beale restarts 

TrainCGP = Conjugate Gradient Back-propagation with Polak-Ribiere updates 
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TAVA RESULTS 

Table 6.11: SinxSinv Problem 


"sinxSinv Problem: ANN Architectures used; Error Goal = 0.000 

1 


Architecture 

Activation Fn. 

Aleorithm 

Iterations 






Sinxlny_nl 

[15 15 15 1] 

Tansig Tansig Tansig 

Tansig 

TrainSCG 

2767 



Tansig Tansig Tansig 

Logsig 

TrainSCG 

2338 

SinxSiny_n2 

[25 25 1] 

Tansig Tansig Tansig 



SinxSiny_n21 

[25 25 1] 

Tansig Tansig logsig 

TrainSCG 

7760 

SinxSiny_n3 


Tansig Tansig Tansig 

Logsig 

TrrainRP 

801 

SinxSiny_n4 

[35 35 1] 

Tansig Tansig Tansig 

TrainRP 

5000 


Conclusion: F rom the achieved results the following are concluded:- 

(a) Sinxsiny requires very rigorous error goal. 

(b) The number of neurons required is very large. 

(c) Only Levenberg-Marquardt second order algorithm requires less number of 
neurons. 

(d) Both Trainrp and trainscg take longer time and iterations to converge, in the 
case of JAVA. 

(e) The curve fitting was better in the case of JAVA. 

(f) Input data set requires the product term (x*y) also to enable the network to 
capture the curve properly. 
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Figure 6.21 : SinxSiny Plots 
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Figure 6.^3 : Sin(x)Sin(y) Plots 
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Ftpnreejl: Sjn(x)Sm(x) Plots 
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Fipure6..'^ : Sm(x)Sm(y) Plots 
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Figure 6.33fL Sin(x)SinO^) Error Plots (MATLAB) 
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Figure 6/ii*TA Sin(x)Sin(y) Error Plots (JAVA) 
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6.6 Classification Problem 

In the problem considered, it is required to classify points into four classes. The 
points are generated using two functions gl and g2. The values of gl and g2 decide the 
class in which the point should be classified. The two functions are: 

gl (X1,X2, X3, X4,X5)= Xj * exp[(;c 3 -X 5 y]- 2 [(xi*x 4 -iyl*X 5 +X 3 -X 4 -0.4 
g2(Xl,X2,X3,X4,X5)= X 2 * X 3 (2 X 4 ^Xj - 1) - *Xil - 1 


The classification has been according to the rules: 

If gl < 0 and g 2 < 0 then the class = 00 
If gl < 0 and g 2 > 0 then the class = 01 
If gl > 0 and g 2 < 0 then the class =10 
If gl > 0 and g 2 > 0 then the class =11 

The values of the functions depend on the variables xi to xs, which vary from 0 to 2 at 
spacing of 0.2/0.3/0.4. A total of 1 .6 lakh points were generated and classified. A total of 5 
- 8 % of the total points generated have been used for training and the rest were used for 
simulation. The results and plots of classification are shown in figures 6.50 (MATLAB) 
and figures 5.51 to 5.53 (JAVA). The architectures that were found to give convergence 

for the above problem are: 

MATLAB RESULTS 

Table 6.12: Classification Problem 


ri^<;..ificat.ion Problem: ANN Architectures used; Error Goal - 0.01 


Network 

Architecture 

Activation Fn. 

Alsonmm 

iierauuns) 

Class21rp 

61 59 612 

Tansig Tansig Tansig Logsig 

Trainrp 

573 

Class31rp 

1917 192 

Tansig Tansig Tansig Logsig 

Trainrp 

>30000 

Class41rp 

37 35 37 2 

Tansig Tansig Tansig Logsig 

Trainrp 

113 

Class21scg 1 

61 59 61 2*^ 

Tansig Tansig Tansig Logsig 

Tramscg 

75j 

•f >5 oo 

Class31scg 

21 19 21 2 

Tansig Tansig Tansig Logsig 

Trainscg 

1332 

1 1 oo 

Class41scg 

Class42scg 

15 15 15 2 

61 49 45 2 

Tansig Tansig Tansig Logsig 

Trainscg 


Tansig Tansig Tansig Logsig 

Trainscg 

101 
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JAVA RESULTS 

Table 6.13 : Classification Problem 


Classification Problem: ANN Architectures used- Frror Goal = 0 01 


Architecture 

Activation Fn. 

Algorithm 

Iterations 

Class212 

61 59 61 2 

Tansig Tansig Tansig Logsig 

Trainrp 

1814 

ClassSl 

1917192 

Tansig Tansig Tansig Logsig 

Trainrp 

3000 

Class41 

37 35 37 2 

Tansig Tansig Tansig Logsig 

Trainrp 

393 

Class212 

61 59 61 2 

Tansig Tansig Tansig Tangsig 

Trainscg 

■gni 

Class3 1 

21 19 21 2 

Tansig Tansig Tansig Tangsig 

Trainscg 

2500 

Class42 

61 49 45 2 

Tansig Tansig Tansig Tangsig 

Trainscg 

646 

Class212 

61 59 61 2 

Tansig Tansig Tansig Logsig 

Trainscg 

2761 

Class3 1 

21 19 21 2 

Tansig Tansig Tansig Logsig 

Trainscg 

1645 

iBisyni 

15 15 15 2 

Tansig Tansig Tansig Logsig 

Trainscg 

961 


On simulation of the trained networks it was found that the error in classification was 
around 6 - 8 % in the case of Trainrp algorithms and around 4- 6% in the case of Trainscg 
for MATLAB.In the case of JAVA the error performance was slightly better. 

Conclusion: From the above results it is concluded that: 

(a) Trainrp algorithm gave better results in the case of JAVA. 

(b) The minimum error in classification was achieved when trainscg was used. 

(c) The use of points near the boundaries for training improved the network 
performance. 

(d) Around 7 to 10% points are required for training. 

(e) The network required at least three hidden layers. 

(f) The number of neurons required were large and hence training takes 

considerable time. 
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6.7 Exponential Sine Curve Problem 

In this problem time series analysis was carried out. This is a problem of function 
approximation. The function is an exponentially increasing sine curve. The effort was to 
build a network that could predict the future points of the curve, given the previous ones. 
The problem was solved using three networks. One each for sinusoidal and exponential 
curves and the third one for the final output (sin(x)exp(x)). In the individual networks 
some of the points were used for training and the remaining were generated or predicted. 
Both these networks used time series prediction. Every three consecutive input-output sets 
were used for prediction of the fourth output. This is how the training set was made. On 
training the individual networks were used to predict the r emainin g of the curve. The thir d 
network was employed to carry out the operation of multiplication. The inputs to this 
network are the predicted sine (sin(x)) and exponential (exp(x)) points and the output is 
sin(x)*exp(x). Approximating a multiplication operation of such a kind required a very 
thorough search for an appropriate network. The results and the plots are shown in figures 
6.54 to5.57. Two networks used, one each for sinusoidal and exponential curve time series 
prediction and one for the prediction of output. Z(x,y) = Sin(x) Sin(y). . The network 
architectures that were found to give good approximation are: 


MATLAB RESULTS 

Table 6.14: SinfxlExpfxl Approximation 


Sinexp Function Annroximation Problem: ANN Architectures used; 

mmmm 

Architecture 

Activation Fn. 

Algorithm 

Enochs 


Exp_fii 

Single neuron 

Purelin 

Trainrp 

229 

1 e-7 


Single neuron 

Purelin 

Trainrp 

800 

1 e-7 

Sinexp_fh 

[21 20 21 1] 

Tansig Tansig Tansig 

Tansig 

Trainrp 

3390 

7 e-7 

Exp_fhl 

Single neuron 

Purelin 

Trainscg 

17 

1 e-7 

Sine_j5il 

Single neuron 

Purelin 

Trainscg 

6 

1 e-7 

Sinexp_fiil 

[21 20 21 1] 

Tansig Tansig Tansig 

Tansig 

Trainrp 

28538 

7 e-7 
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JAVA RESULTS 

Table 6.15: Sin('x')Exp(x^ Approximafinn 


Smexp Function Approximation Problpm- ANN 








Network 

Architecture 

Activation Fn. 

Alsorithm 

Iterations 

Error Goal 

Exp_fh 

Single neuron 

Purelin 




Sine_fn 

Single neuron 

Purelin 

Trainrp 

347 


Sinexp_fii 

[21 20 21 1] 

Tansig Tansig 
Tansig Tansig 

Trainrp 

97466 

1 e-6 


Conclusion: From the above results it can be concluded that: 

(a) Only Trainrp and Trainscg could perform the time series prediction of sine and 
exponential curves. 

(b) The error goeil reached by MATLAB networks was more rigorous. However, 
even with inferior error performance JAVA predicted the curve well. 

(c) JAVA codes took sufficiently long time to converge. 

(d) Time series prediction for sine and exponential curves requires a single neuron 
in the output layer. 

(e) To perform the function of multiplication neural network requires a very big 
architecture, with as many as three hidden layers having sixty neurons. 

(f) Time series prediction using Trainscg does not work in JAVA. 
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X 10 ® 



Extrapolation of Sinusoidal Curve 
Sine [2] TrainRP, Epochs = 800 



Figure 6.55: Extrapolation Curves 




ERROR" 


Sine-ExDonential Plot 

Exp fill Error Plot 



Sme_fiil Error Plot 



Fipiire 6.56a: Function Approximation (y - sin(x)e5q)(x)) 









Extrapolation of Exponential Curve 
Exp [2] TrainSCG Epochs = 17 






JAVA PLOT 

Extrapolation of Exponential Curve 
Exp [1] {Purelin} TrainRP Epochs: 100000 



Extrapolation of Sinusoidal Curve 
Sin [1] {Purelin} TrainRP Epochs: 100000 

Actual=Bold; SimuIated=Dottecl 



Sinexp_fol [21 20 21 1] { Tansig Tansig Tansig Tansig} 
TrainRP Epochs:97466 



Fi^ire esn-. Fxmction Approximation (y=sin(x)*e?q)(x)) 
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